bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
386 stars 189 forks source link

Confirm that use of BLAST's `-max_target_seqs` is intentional #50

Open armish opened 5 years ago

armish commented 5 years ago

Hi there,

This is a semi-automated message from a fellow bioinformatician. Through a GitHub search, I found that the following source files make use of BLAST's -max_target_seqs parameter:

Based on the recently published report, Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows, there is a strong chance that this parameter is misused in your repository.

If the use of this parameter was intentional, please feel free to ignore and close this issue but I would highly recommend to add a comment to your source code to notify others about this use case. If this is a duplicate issue, please accept my apologies for the redundancy as this simple automation is not smart enough to identify such issues.

Thank you! -- Arman (armish/blast-patrol)

ursky commented 5 years ago

Thanks Arman, that was an interesting read. I am definitely guilty of assuming -max_target_seqs returned the best hit. In this case however, the code comes from a previously published software Blobology, which I included in my pipeline as is, for simplicity. I see you also warned them, so lets see what they decide to do. Can you suggest a one-liner that would yield the best hit without custom scripting?

In this particular case, this misuse is not as terrible, as the goal is to find an approximate taxonomy of each contig quickly. This is done for visualization only and is not meant to be interpreted as the definitive taxonomy of the contig. That is done later in the pipeline by Taxato-tk.

armish commented 5 years ago

thanks for the details, @ursky! Let me link those two issues for the sake better traceability then: https://github.com/blaxterlab/blobology/issues/9