SaraEl-Metwally / LightAssembler

Lightweight resources assembly algorithm
GNU General Public License v3.0
19 stars 1 forks source link

Error: maximum supported read length for this version = 1024 #5

Open tseemann opened 8 years ago

tseemann commented 8 years ago

I can't seem to assemble my data, 5 Mbp bacterial genome, PE reads. I've tried various k and g etc.

LightAssembler -k 31 -G 5000000 -t 72 /data/R1.fq.gz /data/R2.fq.gz --verbose
--- Parameters extrapolation.

--- h(0):m(0):s(8) elapsed time.
--- start with gap size g = 8
--- average read length = 137
--- average sequencing coverage = 131

--- Uniform kmers sampling.

--- h(0):m(0):s(0) elapsed time.
--- total number of kmers in BloomA = 0
--- BloomA false positive rate = 0
--- probability of an incorrect kmer appears in the sample : 0.151046

--- Trusted/untrusted kmers filtering.

--- h(0):m(0):s(0) elapsed time.
--- total number of kmers in BloomB = 0
--- BloomB false positive rate = 0
--- LightAssembler can not assemble your dataset !!!
--- maximum supported read length for this version = 1024
--- try different values for k [kmer size] & g [gap size] or different dataset
tseemann commented 8 years ago

I think the bug is that you do not support read files with path in them?

So ecoli.fastq.gz works, but not /path/to/the/reads/ecoli.fastq.gz ?

SaraEl-Metwally commented 7 years ago

screenshot from 2016-09-30 04_32_23

SaraEl-Metwally commented 7 years ago

As you can see, LightAssembler supports the path to read files. Thanks!

tseemann commented 7 years ago

The path suggestion was just one idea I had.

Can you suggest any other reasons why we are unable to get any results with your software?

SaraEl-Metwally commented 7 years ago

Can you give me the exact command line that you are using for your dataset?

michaelbarton commented 7 years ago

I believe @jfroula was having the same problem running the software here at the JGI. Jeff, perhaps you could outline the problem you were having, if you have your code samples at hand?

michaelbarton commented 7 years ago

My experience is that this appears to be related to the -G flag. If the value is not set to an accurate value. I've found using 10x the anticipated value appears to make this error go away. Assuming we're describing the same error cause. LightAssembler appears to generate the same error message, regardless of the cause.

SaraEl-Metwally commented 7 years ago

Sorry for late reply, @michaelbarton, The value of -G flag, the genome size, should be relatively accurate because it plays a key role in determining the size of Bloom filter, its false positive rate, which affects trusted/untrusted kmers filtering step of LightAssembler (i.e. LightAssembler results). I tried different genome size values for GAGE Staphylococcus_aureus (genome size: 2903081 bp) to see the effect of genome size values on the assembly results. (genome size: 1803081 bp)

screenshot from 2017-02-16 05-08-16

(genome size: 1103081 bp) screenshot from 2017-02-16 10-12-41

LightAssembler generates a general message if it fails to assemble the given data set saying some suggestions that cause the failure such as read length, gap size or kmer size. I will also mention that the genome size value should be relatively accurate in this generated message. I sent an email to @jfroula to know his issues with LightAssembler so I can fix them.

Thank you so much.

michaelbarton commented 7 years ago

Thanks for following up. I believe it may not be possible to have an accurate estimate of the genome size ahead of time, for example when assembling a novel genome for the first time. It can be possible to approximate size from the observation rate of unique kmers when sampling from the reads however this could be error prone if light assembler is particularly sensitive to this value.