Tulip footprint for a large plant genome

dcopetti commented 5 years ago

Hello, we have ~30x coverage ONT (N50 9.3 kb, 165 Gb) of a ~5 Gb plant genome and we would like to assemble it with Tulip - we wonder if it will run smoothly on such a genome. To run it on a cluster, we also need to estimate the resources needed: do you think that 1 TB memory will be enough? will 60-100 cores be enough to run the assembly in a decent time? (is it possible to get an estimated assembly time?) Also, how much storage space do you think we will need? Would 4 TB be enough to write intermediate files? Lastly, is the I/O speed of any importance at any step? We wonder if you can tell us how to guestimate how many resources to allocate in our case. Thanks, Dario

hjjansen commented 5 years ago

Hi Dario, It shouldn't need too many resources as it is single threaded and the graphs are not too complicated. But I'm not sure if this version of TULIP will be able to assemble this genome. We haven't tested larger genomes with this version of TULIP and it might get stuck on more complicated parts of the genome. Development of TULIP continued offline and it was rewritten by the original author @christiaanhenkel in the julia language. This is much faster, has better memory management, and graph support. We at Future Genomics Technologies are now getting a version ready for release. That is currently not yet available but we've used it internally on a number of genome assemblies. The publicly available NA12878 nanopore dateset was used and found to assemble to 2.7 Gbp with an a NG50 of ~11Mbp. A mummer plot showed nice agreement with hg38. As far as I've seen max memory was ~15 GB and it completed in just under 4 hours. It still runs on a single thread. We also used it to assemble the 34 Gbp Tulip genome. There was only 6x coverage (working on that) but this version of TULIP assembled 24 Gbp (not very high N50 due to lack of data). We also used it with ~11x coverage on a 50 Gbp fish genome which assembled to 48 Gbp (again with not impressive N50) but we could easily find genes (very bloated) and also all Hox clusters. So size of the genome doesn't seem to be a problem. Complete with seed selection and alignments this took 420 CPU hours and maxed out on 70 GB of RAM memory. The output is a contig fasta file with uncorrected reads stitched together so correction and polishing are still needed. It can also give all the reads that belong to a contig so pileups or local assemblies can be made using other tools. If you are interested in using this new version please get in touch. Thanks, Hans

dcopetti commented 5 years ago

Hi Hans, With such low computational demands, the new version of TULIP would definitely be interesting to test. At which development stage is your software nowadays? Will a person with moderate informatics background be able to install and run on Unix? I am new to the julia language, that would be my main doubt as of now. Do you offer de novo genome assembly as a service? Thanks, Dario

christiaanhenkel commented 5 years ago

Hi Dario, The new TULIP should be available quite soon, I hope. I'm currently testing it on several datasets, and should write a short manual. Moderate informatics background should be just fine! Julia and its packages are easy to install, and we might be able to make a binary available at some point. Chris

hjjansen commented 5 years ago

Hi Dario, In addition to what Christiaan wrote I could add that we indeed offer bioinformatics services like de novo assembly and annotation using the maker pipeline. Cheers, Hans

dcopetti commented 5 years ago

@christiaanhenkel: sounds good, we are looking forward to test a new version on our data! @hjjansen: good to know

dcopetti commented 5 years ago

Hello, Any update regarding the possibility to run TULIP on our complex plant genome? Thanks, Dario

dcopetti commented 5 years ago

Hello, Any update regarding the possibility to run TULIP on our complex plant genome? Thanks, Dario

hjjansen commented 3 years ago

It took a long while but a new version of tulip is now available at futuregenomics.tech. This includes a script to extract seeds from long reads. Tulip has been completely rewritten and performs much better. On the 34 Gb tulip genome it took 1500 cpu hours and not more than 120 GB of RAM.

Generade-nl / TULIP

Tulip footprint for a large plant genome #11