JavierUrban / Genome-assembly-of-the-copepod-Leptodiaptomus

This repository contains a short description of the workflow for the assembly and comparison of genomes of the copepod Leptodiaptomus group sicilis that is under a process of ecological speciation.
MIT License
1 stars 0 forks source link

Difficulties to use databases running BLAST #3

Closed JavierUrban closed 3 years ago

JavierUrban commented 3 years ago

Do you need to create databases or is there a better option?

I created a database for copepods:

makeblastdb -in ../blastpacope/genomas_copepods/ncbi_dataset/data/db_all_copepods.fna -dbtype nucl -parse_seqids -out my_refrence2.fa

But they weigh a lot and running the BLAST is very slow.

To run BLAST use:

blastn -db my_refrence2.fa -query ../blastpacope/minion/minion_carmen.fa -out results_allgenomes_tab.out -outfmt "6 sframe qseqid sseqid evalue pident mismatch" 

The function -outfmt "6 <options>" shows results in a tab separated table. And although it's hard to see, the line of identity percentage in general it looks higher than 80%, so I think that maybe it is not so contaminated, however, I think it makes a BLAST with many databases it will take longer and be more difficult, so I would like to know if anyone knows a more easier way to run BLAST?

results_blast

Or if it will be a better option to start testing assemblers with different parameters? The first draft I have seems to be fragmented

busco_quast
abelardoacm commented 3 years ago

We've already checked whether blastn is working properly. Long runtimes seems to be more related to strategy itself, so the issue is likely heading towards a software/strategy alternative, treating reads concatenation as a metagenome.

valeriafloral commented 3 years ago

You can try using aligners for big sequence data such DIAMOND or LAST.

For more information on working with long-reads:

Huson, D. H., Albrecht, B., Bağcı, C., Bessarab, I., Gorska, A., Jolic, D., & Williams, R. B. (2018). MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biology direct, 13(1), 6.

Bağcı, C., Beier, S., Górska, A., & Huson, D. H. (2019). Introduction to the analysis of environmental sequences: metagenomics with MEGAN. In Evolutionary Genomics (pp. 591-604). Humana, New York, NY.

abelardoacm commented 3 years ago

Try Alignment-free methods: https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-017-1319-7 <- Tables 1 & 2 https://sourceforge.net/projects/rafts3/ <- Fast Blast alternative

valeriafloral commented 3 years ago

For a review on how using metagenomic tools and construct metagenome-assembled genomes (MAGs):
Chen, L. X., Anantharaman, K., Shaiber, A., Eren, A. M., & Banfield, J. F. (2020). Accurate and complete genomes from metagenomes. Genome research, 30(3), 315-333.

solnavss commented 3 years ago

Also you can give a look to this article: Benchmarking the Minion: evaluating long reads for microbial profiling DOI: 10.1038/s41598-020-61989-x

valeriafloral commented 3 years ago

In this blog the author documents the steps he follows to Remove Microbial Contamination in Reads (bacteria, viral, fungi, protozoa, and archaea) in either short and long reads. He explains that for PacBio long reads, he didn’t find any tools specialized for that, so he tries several tools and concludes that minimap2 best met his needs.

https://yiweiniu.github.io/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/#more

Hope this could be useful.

JavierUrban commented 3 years ago

It was concluded that the blast is very slow due to the large number of reads of my data. Unfortunately from my local computer I can't do it any faster. But my colleagues suggested different alternatives: (1) use alignments with reference genomes and extract the reads of interest, (2) treat the samples as metagenomes to identify the sequences of microorganisms and (3) extract the sequences of mitochondria to assemble and compare them with other species.

Due to the computational resources that I have now, I decided to focus on alternatives 1 and 3, in the first one I will probably lose coverage for assembly de novo but I hope to complete later with more sequences. And in the third I had not thought, but it will help me answer some of my biological questions and due to the amount of data (sequences), I can work better for now.