Problem with running evaluation test

pspealman commented 5 years ago

Hello,

I'm attempting to use BELLA but have encountered a problem in the alignment step. The installation is one CentOS 3.10, using gcc 6.3.0, with virtualenv pip install of simplesam. When I try the test files using ./bella -i test.txt -o bella_output -d 30 the produced output file 'bella_output.out' is empty.

Looking over the print out I find this 'Average length of successful alignment -nan bp'. Everything else seems fine.



paeruginosa30x_0001_5reads.fastq: 0 MB
K-mer counting: BELLA
Output filename: bella_output.out
K-mer length: 17
X-drop: 7
Depth: 30X
Compute alignment: true
Seeding: two-kmer

Running with up to 4 threads
Reading FASTQ file paeruginosa30x_0001_5reads.fastq
Initial parsing, error estimation, and k-mer loading took: 0.0749692s

Cardinality estimate is 27251
Table size is: 186907 bits, 0.0222811 MB
Optimal number of hash functions is: 5
First pass of k-mer counting took: 0.0391129s
Second pass of k-mer counting took: 0.0153589s

Entries within reliable range: 231
Error rate estimate is -nan
Reliable lower bound: 2
Reliable upper bound: 30
Deviation from expected alignment score: 0.2
Constant of adaptive threshold: -nan

Running with up to 4 threads
Reading FASTQ file paeruginosa30x_0001_5reads.fastq
Fastq(s) parsing fastq took: 0.0317728s
Total number of reads: 5

Old number of nonzeros before merging: 474
New number of nonzeros after merging: 474
Old number of nonzeros before merging: 474
New number of nonzeros after merging: 474
Sparse matrix construction took: 0.00283812s

Available RAM is assumed to be: 8000 MB
FLOPS is 255
nnz(output): 10 | free memory: 8.38861e+09 | required memory: 288
Stages: 1 | max nnz per stage: 291271111

Columns [0 - 5] overlap time: 0.00734909s
Creating or appending to output file with 0 MB

Columns [0 - 5] alignment time: 0.0329758s | alignment rate: 937294 bases/s | average read length: 5556.6 | read pairs aligned this stage: 10
Average length of successful alignment -nan bps
Average length of failed alignment 5788.6 bps

Outputted 0 lines in 0.0115177s
Total running time: 0.346414s```

giuliaguidi commented 5 years ago

Hi Pieter,

BELLA does not output anything as the input file _paeruginosa30x_00015reads.fastq contains only five reads we used for debugging purpose.

Could you please run BELLA with this dataset (or a dataset of your choice with a meaningful number of sequences) and let me know if that was the problem? You can download that dataset and substitute its path/location in the test.txt file.

Thanks for pointing this out! I'm going to update the README and pointing to a meaningful input file for testing BELLA.

Best, Giulia

giuliaguidi commented 5 years ago

You would need to generate the ground truth file in order to run the evaluation script on BELLA's output. For convenience, here's a single mapped ground truth associated with the input file I pointed you in the previous comment: ecsample_singlemapped_q10.txt.

You can run the evaluation code located in /bench folder as:

cd bench && make bench

./bench -g ecsample_singlemapped_q10.txt -b bella-output.out -o bella-output-bench.out

You can find more details about how we generate the ground truth files here.

Best, Giulia

pspealman commented 5 years ago

Giulia,

Thanks for your rapid reply. I reran bella using your supplied dataset and it worked without error! Buoyed by that success I tried it on my own data again. Unfortunately, I encountered the same 'Average length of successful alignment -nan bps' error. (output of the run is pasted below)

The dataset I'm using has 29,125 reads of Saccharomyces cerevisiae, generated using Oxford-Nanopore's Minion system. I can supply you with this data if it would help.

K-mer counting: BELLA
Output filename: bc02.out
K-mer length: 17
X-drop: 7
Depth: 25X
Compute alignment: true
Seeding: two-kmer

Running with up to 4 threads
Reading FASTQ file /scratch/ps163/Project_erisapfel/BC02_1726/BC02_1726.fastq
Initial parsing, error estimation, and k-mer loading took: 36.34s

Cardinality estimate is 1.57995e+08
Table size is: 1083648980 bits, 129.181 MB
Optimal number of hash functions is: 5
First pass of k-mer counting took: 118.994s
Second pass of k-mer counting took: 46.3084s

Entries within reliable range: 32968935
Error rate estimate is 7.22259e+10
Reliable lower bound: 2
Reliable upper bound: 24
Deviation from expected alignment score: 0.2
Constant of adaptive threshold: 8.34654e+21

Running with up to 4 threads
Reading FASTQ file /scratch/ps163/Project_erisapfel/BC02_1726/BC02_1726.fastq
Fastq(s) parsing fastq took: 76.1973s
Total number of reads: 29125

Old number of nonzeros before merging: 223502632
New number of nonzeros after merging: 223151408
Old number of nonzeros before merging: 223502632
New number of nonzeros after merging: 223151408
Sparse matrix construction took: 157.177s

Available RAM is assumed to be: 8000 MB
FLOPS is 1315217713
nnz(output): 19250908 | free memory: 8.38861e+09 | required memory: 554426150
Stages: 1 | max nnz per stage: 291271111

Columns [0 - 29125] overlap time: 257.976s
Creating or appending to output file with 0 MB

Columns [0 - 29125] alignment time: 4325.86s | alignment rate: 7.00481e+06 bases/s | average read length: 30719.4 | read pairs aligned this stage: 19250908
Average length of successful alignment -nan bps
Average length of failed alignment 2201.56 bps

Outputted 0 lines in 0.018904s
Total running time: 5137.8s

giuliaguidi commented 5 years ago

Hi Pieter,

It looks like something went wrong during the error estimation from the fastq that in turn affected the alignment threshold. That threshold resulted so high that any alignment passed it.

If you could share your data with me at gguidi@berkeley.edu I’ll look into this problem asap.

Best, Giulia

pspealman commented 5 years ago

Giulia,

Great - I'm getting the file into a shareable format and location presently and will get that information to you asap.

But I also want to add that this is not time critical and it shouldn't be a stress - especially on the weekend!

giuliaguidi commented 5 years ago

Hi Pieter,

I got your file and I'm looking at it. To estimate the error rate from the data set, BELLA uses the Phred scores. BELLA computes it as ASCII(symbol) - 33. The quality scores usually look like (sample):

-..,,'/.,(..-.''$..)*.-..-./-.%-"--&*-,.).,..(!,"(.)."$,.(,$..-.--.$+-.$,)-,.-.-*.,(-(&.)...*,.&+,,'---.+%*/*+.*".&*.'..

while in your dataset they appear like:

&&-1,++,-7<4:8;=5::.(*#+3,1&.%%((*'..-33/6(+%$#()--/14192-,,,,#,).,-*0.,.&,*017;<>7<9<;:9:0/,*(()40)..++.24;:<7498-43

These cause a wrong error rate estimation than in turn affects the alignment threshold. The temporary solution to run your ONT datasets using BELLA is user-defining the error rate using -e flag (like -e 0.15). I run it with e = 0.15and depth = 30 (just for testing purpose) and it should be fine:

Running with up to 2 threads
Cardinality estimate is 1.57995e+08
Table size is: 1083648980 bits, 129.181 MB
Optimal number of hash functions is: 5
First pass of k-mer counting took: 139.92s
Second pass of k-mer counting took: 83.0561s

Running with up to 2 threads
Entries within reliable range: 22426677
Error rate estimate is 0.15
Reliable lower bound: 2
Reliable upper bound: 7
Deviation from expected alignment score: 0.2
Constant of adaptive threshold: 0.356

ONT datasets do not use the regular Phred score encoding (https://www.biostars.org/p/226743/), we will investigate this issue in order to support error estimation from raw ONT datasets.

Thanks for pointing this out!

Best, Giulia

pspealman commented 5 years ago

Giulia,

Thanks the suggestion - I reran bella on the sample and it completed without error! What is interesting is I double-checked the phred score specs on ONT's website, they don't have a formal technical note but a post from a technician (Forrest Brennan) says that they use standard Sanger Phred+33 - so I'm not sure what the issue is. But ONT always makes things unintentionally interesting, so I shouldn't be surprised.

Anyways - thanks for the help!

Pieter

On Sun, May 5, 2019 at 11:54 AM Giulia Guidi notifications@github.com wrote:

Hi Pieter,

I got your file and I'm looking at it. To estimate the error rate from the data set, BELLA uses the Phred scores. BELLA computes it as ASCII(symbol) -

The quality scores usually look like (sample):

-..,,'/.,(..-.''$..).-..-./-.%-"--&-,.).,..(!,"(.)."$,.(,$..-.--.$+-.$,)-,.-.-.,(-(&.)...,.&+,,'---.+%/+.".&.'..

while in your dataset they appear like:

&&-1,++,-7<4:8;=5::.(#+3,1&.%%(('..-33/6(+%$#()--/14192-,,,,#,).,-0.,.&,017;<>7<9<;:9:0/,*(()40)..++.24;:<7498-43

These cause a wrong error rate estimation than in turn affects the alignment threshold. The temporary solution to run your ONT datasets using BELLA is user-defining the error rate using -e flag (like -e 0.15). I run it with e = 0.15 and depth = 30 (just for testing purpose) and it should be fine:

Running with up to 2 threads Cardinality estimate is 1.57995e+08 Table size is: 1083648980 bits, 129.181 MB Optimal number of hash functions is: 5 First pass of k-mer counting took: 139.92s Second pass of k-mer counting took: 83.0561s

Running with up to 2 threads Entries within reliable range: 22426677 Error rate estimate is 0.15 Reliable lower bound: 2 Reliable upper bound: 7 Deviation from expected alignment score: 0.2 Constant of adaptive threshold: 0.356

ONT datasets do not use the regular Phred score encoding ( https://www.biostars.org/p/226743/ https://urldefense.proofpoint.com/v2/url?u=https-3A__www.biostars.org_p_226743_&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=Mo-SiWzyzQ6FkL-ARfL29MRTy7tKeUodWa3gFMRa0B8&m=TW7CW1_A2xBHI8jZBFgQgr9n3YkVC8bXI7mG4PIH_Bc&s=nfs_Hb55VtLJfGM9MfkU-Ykj7anaKv_KaVrN6Arr8Ds&e=), we will investigate this issue in order to support error estimation from raw ONT datasets.

Thanks for pointing this out!

Best, Giulia

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_giuliaguidi_bella_issues_6-23issuecomment-2D489438796&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=Mo-SiWzyzQ6FkL-ARfL29MRTy7tKeUodWa3gFMRa0B8&m=TW7CW1_A2xBHI8jZBFgQgr9n3YkVC8bXI7mG4PIH_Bc&s=yaLKHejGK7QHwl32NATAvyigfy-G0xaSH_qFvcU5KAY&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AH2S4QFZWIAL2N7FYB7BVL3PT37MNANCNFSM4HKWCTNA&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=Mo-SiWzyzQ6FkL-ARfL29MRTy7tKeUodWa3gFMRa0B8&m=TW7CW1_A2xBHI8jZBFgQgr9n3YkVC8bXI7mG4PIH_Bc&s=FWqP4_uzb4osqICDOgsuocIVLOzonCP-g40ggUbUA_U&e= .

giuliaguidi commented 5 years ago

I will look more into this and will get back to you if I find an explanation/solution.

Best, Giulia

giuliaguidi commented 5 years ago

ONT does use Phred score - 33 as PacBio.

Pieter and I figured out that some sequences in the fastq file trigger a wrong estimation of the error rate from raw data as subsampling the dataset the error estimation is correct.

Pieter shed some light on this: Those reads are produced by long-repeat inversions which the ONT base-callers (both Albacore and Guppy 2+) cannot accurately resolve. The base-caller just gives up and labels everything with a low Phred score until the inversion is passed. Potentially, the next generation of ONT base-callers may be able to handle it.

References to structural variations that ONT looks currently unable to resolve:

ONT did just release Guppy 3.0 base-caller that might resolve this issue.

For ONT datasets with this kind of issue, I would suggest users run BELLA explicitly defining the error rate using the -e flag.

PASSIONLab / BELLA

Problem with running evaluation test #6