marginPhase error: Assertion `seqLen >= rProbs->length + rProbs->refStart' failed.

Hi @benedictpaten

do you know what would cause error Assertion seqLen >= rProbs->length + rProbs->refStart' failed. ?

here is the execution log:

aechchik@dee-serv07:/scratch/beegfs/monthly/aechchik/amphioxus/mp$ ../../build/marginPhase /scratch/beegfs/monthly/aechchik/amphioxus/mp/reads_to_haploADref.bam  /scratch/beegfs/monthly/aechchik/amphioxus/hm2/amphio_A_ref_D.fa marginPhase/params/params.pacbio.json
Set log level to INFO
> Parsing model parameters from file: marginPhase/params/params.pacbio.json
> Parsing input reads from file: /scratch/beegfs/monthly/aechchik/amphioxus/mp/reads_to_haploADref.bam
        Created 9724438 profile sequences
> Parsing prior probabilities on positions from reference sequences: /scratch/beegfs/monthly/aechchik/amphioxus/hm2/amphio_A_ref_D.fa
marginPhase: /scratch/beegfs/monthly/aechchik/amphioxus/mp/marginPhase/impl/referencePriorProbs.c:171: createReferencePriorProbabilities: Assertion `seqLen >= rProbs->length + rProbs->refStart' failed.
Aborted

thanks Amina

FYI, just discovered that another program that I run in parallel failed because I was running out of disk space. Maybe that can be the cause.

Will keep you updated

EDIT update: nope, the error is still there. thanks!

If it helps:

$ ./marginPhase/build/marginPhase -v
usage: marginPhase <BAM> <REFERENCE_FASTA> <PARAMS> [options]
Version: 1.0.0

$ uname -a
Linux dee-serv07.vital-it.ch 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Can confirm, got the same error using ONT data ./marginPhase/build/marginPhase /scratch/test/sizealleles/bams/Lib2_BC06.bam /scratch/test/sizealleles/ref/sizealleles.fasta ./marginPhase/params/params.nanopore.json Set log level to INFO > Parsing model parameters from file: ./marginPhase/params/params.nanopore.json > Parsing input reads from file: /scratch/test/sizealleles/bams/Lib2_BC06.bam Created 36512 profile sequences > Parsing prior probabilities on positions from reference sequences: /scratch/test/sizealleles/ref/sizealleles.fasta marginPhase: ./marginPhase/impl/referencePriorProbs.c:171: createReferencePriorProbabilities: AssertionseqLen >= rProbs->length + rProbs->refStart' failed.` Aborted`

was similarly run on CentOS

I found a workaround. The solution is to add about a hundred lines of just N's to the reference FASTA file and then marginPhase works. Note that already the example reference file of marginPhase, i.e. hg19.chr3.9mb.fa starts with such a block.

Sorry for the trouble, what you're seeing looks like a mismatch between the read coordinates (presumably coming from the bam) and the reference sequences, as provided by the input fasta. This could happen if you did not have the same reference assembly in the bam and reference fasta. Hope that helps.

PS, you might want to switch off debug flags once you've got it running fast and enable compile optimization. It will be a lot faster.

On Fri, Nov 23, 2018 at 4:03 AM klucek notifications@github.com wrote:

I found a workaround. The solution is to add about a hundred lines of just N's to the reference FASTA file and then marginPhase works. Note that already the example reference file of marginPhase, i.e. hg19.chr3.9mb.fa starts with such a block.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-441224390, or mute the thread https://github.com/notifications/unsubscribe-auth/AAkOIIRRnNMJjWi-ny3l04rWXIEewS15ks5ux-QUgaJpZM4Yns-m .

-- Benedict (calendar invites: bpaten@ucsc.edu, appointments: Tina Bernard tibernar@ucsc.edu)

Hi, thanks for the reply, I was actually about to write you more about this

to add some background, the data that I have is Oxford Nanopore amplicon sequences of two highly variable genes that are linked to self-incompatibility in the plant Arabidopsis lyrata. I generated for each individual an assembly using CANU and identified for each gene 7 major haplotypes that differ in their length due to different deletions in the introns. I thus generated a fake reference genome using these major contigs and aligned my ONT data with bwa mem. Because each individual is expected to be heterozygous (similar to MHC), I was hoping to use marginPhase.

I have to note that the sequences in the BAM file still include the multiplex barcodes, which is not true for the reference. So even though I use the reference against which I have aligned my data, the reads in the BAM file are longer.

Another question - marginPhase seems to downsample reads to a coverage of 64 - given that I have amplicons, I basically have a read-depth of >6000x. Do you think it could change something allowing for a higher depth? Thanks a lot Best Kay

On 23 Nov 2018, at 17:24, Benedict Paten notifications@github.com wrote:

Sorry for the trouble, what you're seeing looks like a mismatch between the read coordinates (presumably coming from the bam) and the reference sequences, as provided by the input fasta. This could happen if you did not have the same reference assembly in the bam and reference fasta. Hope that helps.

PS, you might want to switch off debug flags once you've got it running fast and enable compile optimization. It will be a lot faster.

On Fri, Nov 23, 2018 at 4:03 AM klucek notifications@github.com wrote:

I found a workaround. The solution is to add about a hundred lines of just N's to the reference FASTA file and then marginPhase works. Note that already the example reference file of marginPhase, i.e. hg19.chr3.9mb.fa starts with such a block.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-441224390, or mute the thread https://github.com/notifications/unsubscribe-auth/AAkOIIRRnNMJjWi-ny3l04rWXIEewS15ks5ux-QUgaJpZM4Yns-m .

-- Benedict (calendar invites: bpaten@ucsc.edu, appointments: Tina Bernard tibernar@ucsc.edu) — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-441280138, or mute the thread https://github.com/notifications/unsubscribe-auth/AUNm5cRM19o1AVft9799Bk_48e8Gg2qWks5uyCEugaJpZM4Yns-m.

Hi @benedictpaten, thanks for the reply.

well, for my original run, I really believe that the reference fed as assembly (input fasta) and as mapping reference for the reads (input bam) was exactly the same (since I only have one).

I am then not sure where the problem comes from. The mapping was generated with minimap2 (mode long-reads to reference), but I will also try to feed the mapping by bwa-mem. I'll also see if @klucek's workaround (to add Ns to the contig sequence) turns out helpful in any way.

Keeping you updated.

Thanks,
Amina

Hi Amina, Kay -

Bit confused about who is doing what (sorry to not read in detail), but two points:

If the assembly provided and the bam that the reads were mapped to is different then all bets are off. Padding might be a solution to at least avoid a seg fault (this is C code - no bounds checking), but if the assembly and the ref differs in base composition, it will produce odd results.
For coverage beyond 64 depth, the thing to do is to generate the two haplotypes from the vcf and then map the reads to the haplotypes, picking for each read the haplotype that has the best score.

It is easy to do this within marginPhase too, and I'll add as a feature if I can.

On Mon, Nov 26, 2018 at 2:35 AM Amina Echchiki notifications@github.com wrote:

Hi @benedictpaten https://github.com/benedictpaten, thanks for the reply.

well, for my original run, I really believe that the reference fed as assembly (input fasta) and as mapping reference for the reads (input bam) was exactly the same (since I only have one).

I am then not sure where the problem comes from. The mapping was generated with minimap2 (mode long-reads to reference), but I will also try to feed the mapping by bwa-mem. I'll also see if @klucek https://github.com/klucek's workaround (to add Ns to the contig sequence) turns out helpful in any way.

Keeping you updated.

Thanks, Amina

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-441592779, or mute the thread https://github.com/notifications/unsubscribe-auth/AAkOIBlIP0nykCdwwP93n-06YLPE0OB8ks5uy8PlgaJpZM4Yns-m .

-- Benedict (calendar invites: bpaten@ucsc.edu, appointments: Tina Bernard tibernar@ucsc.edu)

Hi Benedict, I finally got my hands on this again and I guess I have some additional insights - the problem was indeed that the reads were longer than my “reference genome”. This is because I work with amplicons for a region that is not covered in the real reference genome. My fake reference is a consensus of individual assemblies obtained with CANU. The solution to this was hard clipping of the initial alignment followed by a realignment and marginPhase works perfectly.

Attached is an example - the two haplotypes differ among other by a 500bp deletion which marginPhase gets mostly resolved. A question though - because I have amplicons sequenced, I have depths of up to 1000x, yet marginPhase has a hard limit of 64 reads. Do you think increasing this would improve things in my case? To do so, I suppose I can just increase the number in the partitions.c file (line 18). Thanks a lot Best Kay

On 27 Nov 2018, at 22:10, Benedict Paten <notifications@github.com mailto:notifications@github.com> wrote:

Hi Amina, Kay -

Bit confused about who is doing what (sorry to not read in detail), but two points:

If the assembly provided and the bam that the reads were mapped to is different then all bets are off. Padding might be a solution to at least avoid a seg fault (this is C code - no bounds checking), but if the assembly and the ref differs in base composition, it will produce odd results.

For coverage beyond 64 depth, the thing to do is to generate the two haplotypes from the vcf and then map the reads to the haplotypes, picking for each read the haplotype that has the best score.

It is easy to do this within marginPhase too, and I'll add as a feature if I can.

On Mon, Nov 26, 2018 at 2:35 AM Amina Echchiki <notifications@github.com mailto:notifications@github.com> wrote:

Hi @benedictpaten <https://github.com/benedictpaten https://github.com/benedictpaten>, thanks for the reply.

well, for my original run, I really believe that the reference fed as assembly (input fasta) and as mapping reference for the reads (input bam) was exactly the same (since I only have one).

I am then not sure where the problem comes from. The mapping was generated with minimap2 (mode long-reads to reference), but I will also try to feed the mapping by bwa-mem. I'll also see if @klucek <https://github.com/klucek https://github.com/klucek>'s workaround (to add Ns to the contig sequence) turns out helpful in any way.

Keeping you updated.

Thanks, Amina

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-441592779 https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-441592779>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAkOIBlIP0nykCdwwP93n-06YLPE0OB8ks5uy8PlgaJpZM4Yns-m https://github.com/notifications/unsubscribe-auth/AAkOIBlIP0nykCdwwP93n-06YLPE0OB8ks5uy8PlgaJpZM4Yns-m> .

-- Benedict (calendar invites: bpaten@ucsc.edu mailto:bpaten@ucsc.edu, appointments: Tina Bernard <tibernar@ucsc.edu mailto:tibernar@ucsc.edu>) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-442218191, or mute the thread https://github.com/notifications/unsubscribe-auth/AUNm5UfvfURaiD4BRwYqLlZzNTIC6MUvks5uzapdgaJpZM4Yns-m.

Great. It is not trivial to increase the max allowable depth (because we use 64 bit integers to model bipartitions), but the simple solution is to realign each read > 64x coverage to each haplotype and then pick the haplotype that fits best. Does that make sense?

On Fri, Jan 11, 2019 at 12:23 AM klucek notifications@github.com wrote:

Hi Benedict, I finally got my hands on this again and I guess I have some additional insights - the problem was indeed that the reads were longer than my “reference genome”. This is because I work with amplicons for a region that is not covered in the real reference genome. My fake reference is a consensus of individual assemblies obtained with CANU. The solution to this was hard clipping of the initial alignment followed by a realignment and marginPhase works perfectly.

Attached is an example - the two haplotypes differ among other by a 500bp deletion which marginPhase gets mostly resolved. A question though - because I have amplicons sequenced, I have depths of up to 1000x, yet marginPhase has a hard limit of 64 reads. Do you think increasing this would improve things in my case? To do so, I suppose I can just increase the number in the partitions.c file (line 18). Thanks a lot Best Kay

On 27 Nov 2018, at 22:10, Benedict Paten <notifications@github.com mailto:notifications@github.com> wrote:

Hi Amina, Kay -

Bit confused about who is doing what (sorry to not read in detail), but two points:

If the assembly provided and the bam that the reads were mapped to is different then all bets are off. Padding might be a solution to at least avoid a seg fault (this is C code - no bounds checking), but if the assembly and the ref differs in base composition, it will produce odd results.

For coverage beyond 64 depth, the thing to do is to generate the two haplotypes from the vcf and then map the reads to the haplotypes, picking for each read the haplotype that has the best score.

It is easy to do this within marginPhase too, and I'll add as a feature if I can.

On Mon, Nov 26, 2018 at 2:35 AM Amina Echchiki <notifications@github.com mailto:notifications@github.com> wrote:

Hi @benedictpaten <https://github.com/benedictpaten < https://github.com/benedictpaten>>, thanks for the reply.

well, for my original run, I really believe that the reference fed as assembly (input fasta) and as mapping reference for the reads (input bam) was exactly the same (since I only have one).

I am then not sure where the problem comes from. The mapping was generated with minimap2 (mode long-reads to reference), but I will also try to feed the mapping by bwa-mem. I'll also see if @klucek <https://github.com/klucek https://github.com/klucek>'s workaround (to add Ns to the contig sequence) turns out helpful in any way.

Keeping you updated.

Thanks, Amina

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-441592779 < https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-441592779 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAkOIBlIP0nykCdwwP93n-06YLPE0OB8ks5uy8PlgaJpZM4Yns-m < https://github.com/notifications/unsubscribe-auth/AAkOIBlIP0nykCdwwP93n-06YLPE0OB8ks5uy8PlgaJpZM4Yns-m

.

-- Benedict (calendar invites: bpaten@ucsc.edu mailto:bpaten@ucsc.edu, appointments: Tina Bernard <tibernar@ucsc.edu <mailto:tibernar@ucsc.edu

) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-442218191>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AUNm5UfvfURaiD4BRwYqLlZzNTIC6MUvks5uzapdgaJpZM4Yns-m .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benedictpaten/marginPhase/issues/12#issuecomment-453425629, or mute the thread https://github.com/notifications/unsubscribe-auth/AAkOIG4micWAQBCwMpu8ctZyBcBanpbpks5vCEn7gaJpZM4Yns-m .

-- Benedict (calendar invites: bpaten@ucsc.edu, appointments: Tina Bernard tibernar@ucsc.edu)

Ive run into the same issue as described above using ONT data.

benedictpaten / marginPhase

marginPhase error: Assertion `seqLen >= rProbs->length + rProbs->refStart' failed. #12