Closed JavierUrban closed 3 years ago
We've already checked whether blastn is working properly. Long runtimes seems to be more related to strategy itself, so the issue is likely heading towards a software/strategy alternative, treating reads concatenation as a metagenome.
You can try using aligners for big sequence data such DIAMOND or LAST.
For more information on working with long-reads:
Huson, D. H., Albrecht, B., Bağcı, C., Bessarab, I., Gorska, A., Jolic, D., & Williams, R. B. (2018). MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biology direct, 13(1), 6.
Bağcı, C., Beier, S., Górska, A., & Huson, D. H. (2019). Introduction to the analysis of environmental sequences: metagenomics with MEGAN. In Evolutionary Genomics (pp. 591-604). Humana, New York, NY.
Try Alignment-free methods: https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-017-1319-7 <- Tables 1 & 2 https://sourceforge.net/projects/rafts3/ <- Fast Blast alternative
For a review on how using metagenomic tools and construct metagenome-assembled genomes (MAGs):
Chen, L. X., Anantharaman, K., Shaiber, A., Eren, A. M., & Banfield, J. F. (2020). Accurate and complete genomes from metagenomes. Genome research, 30(3), 315-333.
Also you can give a look to this article: Benchmarking the Minion: evaluating long reads for microbial profiling DOI: 10.1038/s41598-020-61989-x
In this blog the author documents the steps he follows to Remove Microbial Contamination in Reads (bacteria, viral, fungi, protozoa, and archaea) in either short and long reads. He explains that for PacBio long reads, he didn’t find any tools specialized for that, so he tries several tools and concludes that minimap2 best met his needs.
https://yiweiniu.github.io/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/#more
Hope this could be useful.
It was concluded that the blast is very slow due to the large number of reads of my data. Unfortunately from my local computer I can't do it any faster. But my colleagues suggested different alternatives: (1) use alignments with reference genomes and extract the reads of interest, (2) treat the samples as metagenomes to identify the sequences of microorganisms and (3) extract the sequences of mitochondria to assemble and compare them with other species.
Due to the computational resources that I have now, I decided to focus on alternatives 1 and 3, in the first one I will probably lose coverage for assembly de novo but I hope to complete later with more sequences. And in the third I had not thought, but it will help me answer some of my biological questions and due to the amount of data (sequences), I can work better for now.
Do you need to create databases or is there a better option?
I created a database for copepods:
But they weigh a lot and running the BLAST is very slow.
To run BLAST use:
The function
-outfmt "6 <options>"
shows results in a tab separated table. And although it's hard to see, the line of identity percentage in general it looks higher than 80%, so I think that maybe it is not so contaminated, however, I think it makes a BLAST with many databases it will take longer and be more difficult, so I would like to know if anyone knows a more easier way to run BLAST?Or if it will be a better option to start testing assemblers with different parameters? The first draft I have seems to be fragmented