Closed shelkmike closed 2 years ago
Hi @shelkmike,
Thanks for testing BLEND out with different read sets. This is very much appreciated. I have several questions and comments:
Can you please provide me with the parameter settings that you used for BLEND and Minimap2?
I agree with your observation that you make regarding working with genomes with repetitive regions. I believe this can actually even be speculated from the results we show in our paper. For example, the HiFi human genome overlapping results do not provide the similar speedup that we observe when using E. coli reads. We have not included this observation in the paper as it is still an ongoing progress to be definite about this observation. We believe the main reason can be inferred from the following code snippet:
https://github.com/CMU-SAFARI/BLEND/blob/51c5e46684f78e5494a0642ab5fc7d04b223adc2/src/seed.c#L86-L96
The above code is not commented out in the original Minimap2 implementation. It filters out the k-mers that occur with a high frequency. Perhaps this is mainly useful for such repetitive genomes. However, since we do not know yet how to efficiently setup such a filter without hurting the accuracy (because BLEND assigns the same hash value for potentially similar seeds and this may increase the number of seeds that fall into the same hash value bucket), we disabled this feature. We are likely to re-enable this feature if we can confidently filter out some seeds based on the number of their occurrence without hurting the accuracy. Once re-enabled, we expect a speedup especially for repetitive genomes as we can hopefully deal with repetitive seeds efficiently.
Best,
Can Firtina
Thank you for the answer. The commands for the reads of the nematode were
blend -x ava-pb -t 22 $long_reads $long_reads 2>blend_logs.txt | gzip -1 > overlaps_of_long_reads.paf.gz
minimap2 -x ava-pb -t 22 $long_reads $long_reads 2>minimap2_logs.txt | gzip -1 > overlaps_of_long_reads.paf.gz
For the reads of arabidopsis the commands were the same, except for "ava-ont" instead of "ava-pb".
I think that BLEND has good potential for improvement. Good luck with it!
In the paper https://arxiv.org/abs/2112.08687 BLEND was tested on only one non-HiFi read dataset. That was a simulated read dataset for one of the smallest eukaryotic genomes — the genome of Saccharomyces cerevisiae.
To test how well BLEND performs on real (non-simulated) datasets of genomes which have more typical sizes, I used it to assemble genomes from these two sets of reads: 1) Caenorhabditis elegans, PacBio CLR reads used in the article https://www.sciencedirect.com/science/article/pii/S2589004220305770 . For polishing I also used Illumina reads from that article. The nematode genome size is approximately 100 Mbp. 2) Arabidopsis thaliana, Nanopore reads https://www.ncbi.nlm.nih.gov/sra/?term=ERR5530736 . For polishing I also used Illumina reads https://www.ncbi.nlm.nih.gov/sra/?term=ERR2173372 . The size of arabidopsis' genome is approximately 120 Mbp.
I searched for overlaps, then assembled the genomes with Miniasm using default parameters, then polished the assemblies using long reads with Racon, and then polished the assemblies using both long and short reads with HyPo. The assemblies were compared with references using QUAST.
The search for overlaps was performed with Blend 1.0 and, for comparison, with Minimap 2.22, using 22 threads of Intel Xeon X5670.
So, the assemblies of the nematode genome made with Minimap2 and with BLEND are similar. However, Blend required 20x more time to find overlaps and 2x more RAM.
For arabidopsis Minimap found overlaps in 30 minutes using 29G RAM. I terminated BLEND because it didn't finish in 24 hours. At the moment I terminated it, BLEND was using 300G RAM.
So, it seems that on non-HiFi datasets for genomes not as small as the genome of Saccharomyces cerevisiae BLEND is slower than Minimap2 and uses more RAM. This may be so because BLEND doesn't deal efficiently with repetitive seeds.