Closed pspealman closed 5 years ago
Hi Pieter,
BELLA does not output anything as the input file _paeruginosa30x_00015reads.fastq contains only five reads we used for debugging purpose.
Could you please run BELLA with this dataset (or a dataset of your choice with a meaningful number of sequences) and let me know if that was the problem? You can download that dataset and substitute its path/location in the test.txt file.
Thanks for pointing this out! I'm going to update the README and pointing to a meaningful input file for testing BELLA.
Best, Giulia
You would need to generate the ground truth file in order to run the evaluation script on BELLA's output. For convenience, here's a single mapped ground truth associated with the input file I pointed you in the previous comment: ecsample_singlemapped_q10.txt.
You can run the evaluation code located in /bench folder as:
cd bench && make bench
./bench -g ecsample_singlemapped_q10.txt -b bella-output.out -o bella-output-bench.out
You can find more details about how we generate the ground truth files here.
Best, Giulia
Giulia,
Thanks for your rapid reply. I reran bella using your supplied dataset and it worked without error! Buoyed by that success I tried it on my own data again. Unfortunately, I encountered the same 'Average length of successful alignment -nan bps' error. (output of the run is pasted below)
The dataset I'm using has 29,125 reads of Saccharomyces cerevisiae, generated using Oxford-Nanopore's Minion system. I can supply you with this data if it would help.
K-mer counting: BELLA
Output filename: bc02.out
K-mer length: 17
X-drop: 7
Depth: 25X
Compute alignment: true
Seeding: two-kmer
Running with up to 4 threads
Reading FASTQ file /scratch/ps163/Project_erisapfel/BC02_1726/BC02_1726.fastq
Initial parsing, error estimation, and k-mer loading took: 36.34s
Cardinality estimate is 1.57995e+08
Table size is: 1083648980 bits, 129.181 MB
Optimal number of hash functions is: 5
First pass of k-mer counting took: 118.994s
Second pass of k-mer counting took: 46.3084s
Entries within reliable range: 32968935
Error rate estimate is 7.22259e+10
Reliable lower bound: 2
Reliable upper bound: 24
Deviation from expected alignment score: 0.2
Constant of adaptive threshold: 8.34654e+21
Running with up to 4 threads
Reading FASTQ file /scratch/ps163/Project_erisapfel/BC02_1726/BC02_1726.fastq
Fastq(s) parsing fastq took: 76.1973s
Total number of reads: 29125
Old number of nonzeros before merging: 223502632
New number of nonzeros after merging: 223151408
Old number of nonzeros before merging: 223502632
New number of nonzeros after merging: 223151408
Sparse matrix construction took: 157.177s
Available RAM is assumed to be: 8000 MB
FLOPS is 1315217713
nnz(output): 19250908 | free memory: 8.38861e+09 | required memory: 554426150
Stages: 1 | max nnz per stage: 291271111
Columns [0 - 29125] overlap time: 257.976s
Creating or appending to output file with 0 MB
Columns [0 - 29125] alignment time: 4325.86s | alignment rate: 7.00481e+06 bases/s | average read length: 30719.4 | read pairs aligned this stage: 19250908
Average length of successful alignment -nan bps
Average length of failed alignment 2201.56 bps
Outputted 0 lines in 0.018904s
Total running time: 5137.8s
Hi Pieter,
It looks like something went wrong during the error estimation from the fastq that in turn affected the alignment threshold. That threshold resulted so high that any alignment passed it.
If you could share your data with me at gguidi@berkeley.edu I’ll look into this problem asap.
Best, Giulia
Giulia,
Great - I'm getting the file into a shareable format and location presently and will get that information to you asap.
But I also want to add that this is not time critical and it shouldn't be a stress - especially on the weekend!
Hi Pieter,
I got your file and I'm looking at it. To estimate the error rate from the data set, BELLA uses the Phred scores. BELLA computes it as ASCII(symbol) - 33. The quality scores usually look like (sample):
-..,,'/.,(..-.''$..)*.-..-./-.%-"--&*-,.).,..(!,"(.)."$,.(,$..-.--.$+-.$,)-,.-.-*.,(-(&.)...*,.&+,,'---.+%*/*+.*".&*.'..
while in your dataset they appear like:
&&-1,++,-7<4:8;=5::.(*#+3,1&.%%((*'..-33/6(+%$#()--/14192-,,,,#,).,-*0.,.&,*017;<>7<9<;:9:0/,*(()40)..++.24;:<7498-43
These cause a wrong error rate estimation than in turn affects the alignment threshold. The temporary solution to run your ONT datasets using BELLA is user-defining the error rate using -e
flag (like -e 0.15
). I run it with e = 0.15
and depth = 30
(just for testing purpose) and it should be fine:
Running with up to 2 threads
Cardinality estimate is 1.57995e+08
Table size is: 1083648980 bits, 129.181 MB
Optimal number of hash functions is: 5
First pass of k-mer counting took: 139.92s
Second pass of k-mer counting took: 83.0561s
Running with up to 2 threads
Entries within reliable range: 22426677
Error rate estimate is 0.15
Reliable lower bound: 2
Reliable upper bound: 7
Deviation from expected alignment score: 0.2
Constant of adaptive threshold: 0.356
ONT datasets do not use the regular Phred score encoding (https://www.biostars.org/p/226743/), we will investigate this issue in order to support error estimation from raw ONT datasets.
Thanks for pointing this out!
Best, Giulia
Giulia,
Thanks the suggestion - I reran bella on the sample and it completed without error! What is interesting is I double-checked the phred score specs on ONT's website, they don't have a formal technical note but a post from a technician (Forrest Brennan) says that they use standard Sanger Phred+33 - so I'm not sure what the issue is. But ONT always makes things unintentionally interesting, so I shouldn't be surprised.
Anyways - thanks for the help!
Pieter
On Sun, May 5, 2019 at 11:54 AM Giulia Guidi notifications@github.com wrote:
Hi Pieter,
I got your file and I'm looking at it. To estimate the error rate from the data set, BELLA uses the Phred scores. BELLA computes it as ASCII(symbol) -
- The quality scores usually look like (sample):
-..,,'/.,(..-.''$..).-..-./-.%-"--&-,.).,..(!,"(.)."$,.(,$..-.--.$+-.$,)-,.-.-.,(-(&.)...,.&+,,'---.+%/+.".&.'..
while in your dataset they appear like:
&&-1,++,-7<4:8;=5::.(#+3,1&.%%(('..-33/6(+%$#()--/14192-,,,,#,).,-0.,.&,017;<>7<9<;:9:0/,*(()40)..++.24;:<7498-43
These cause a wrong error rate estimation than in turn affects the alignment threshold. The temporary solution to run your ONT datasets using BELLA is user-defining the error rate using -e flag (like -e 0.15). I run it with e = 0.15 and depth = 30 (just for testing purpose) and it should be fine:
Running with up to 2 threads Cardinality estimate is 1.57995e+08 Table size is: 1083648980 bits, 129.181 MB Optimal number of hash functions is: 5 First pass of k-mer counting took: 139.92s Second pass of k-mer counting took: 83.0561s
Running with up to 2 threads Entries within reliable range: 22426677 Error rate estimate is 0.15 Reliable lower bound: 2 Reliable upper bound: 7 Deviation from expected alignment score: 0.2 Constant of adaptive threshold: 0.356
ONT datasets do not use the regular Phred score encoding ( https://www.biostars.org/p/226743/ https://urldefense.proofpoint.com/v2/url?u=https-3A__www.biostars.org_p_226743_&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=Mo-SiWzyzQ6FkL-ARfL29MRTy7tKeUodWa3gFMRa0B8&m=TW7CW1_A2xBHI8jZBFgQgr9n3YkVC8bXI7mG4PIH_Bc&s=nfs_Hb55VtLJfGM9MfkU-Ykj7anaKv_KaVrN6Arr8Ds&e=), we will investigate this issue in order to support error estimation from raw ONT datasets.
Thanks for pointing this out!
Best, Giulia
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_giuliaguidi_bella_issues_6-23issuecomment-2D489438796&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=Mo-SiWzyzQ6FkL-ARfL29MRTy7tKeUodWa3gFMRa0B8&m=TW7CW1_A2xBHI8jZBFgQgr9n3YkVC8bXI7mG4PIH_Bc&s=yaLKHejGK7QHwl32NATAvyigfy-G0xaSH_qFvcU5KAY&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AH2S4QFZWIAL2N7FYB7BVL3PT37MNANCNFSM4HKWCTNA&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=Mo-SiWzyzQ6FkL-ARfL29MRTy7tKeUodWa3gFMRa0B8&m=TW7CW1_A2xBHI8jZBFgQgr9n3YkVC8bXI7mG4PIH_Bc&s=FWqP4_uzb4osqICDOgsuocIVLOzonCP-g40ggUbUA_U&e= .
I will look more into this and will get back to you if I find an explanation/solution.
Best, Giulia
ONT does use Phred score - 33 as PacBio.
Pieter and I figured out that some sequences in the fastq file trigger a wrong estimation of the error rate from raw data as subsampling the dataset the error estimation is correct.
Pieter shed some light on this: Those reads are produced by long-repeat inversions which the ONT base-callers (both Albacore and Guppy 2+) cannot accurately resolve. The base-caller just gives up and labels everything with a low Phred score until the inversion is passed. Potentially, the next generation of ONT base-callers may be able to handle it.
References to structural variations that ONT looks currently unable to resolve:
ONT did just release Guppy 3.0 base-caller that might resolve this issue.
For ONT datasets with this kind of issue, I would suggest users run BELLA explicitly defining the error rate using the -e
flag.
Hello,
I'm attempting to use BELLA but have encountered a problem in the alignment step. The installation is one CentOS 3.10, using gcc 6.3.0, with virtualenv pip install of simplesam. When I try the test files using
./bella -i test.txt -o bella_output -d 30
the produced output file 'bella_output.out' is empty.Looking over the print out I find this 'Average length of successful alignment -nan bp'. Everything else seems fine.