asl / BandageNG

a Bioinformatics Application for Navigating De novo Assembly Graphs Easily
GNU General Public License v3.0
114 stars 10 forks source link

difference between tblastn in gui and command line #138

Closed yjk-bertrand closed 1 year ago

yjk-bertrand commented 1 year ago

Thanks for all your work on BandageNG! Running a blast graph search with a protein query recovers plenty of hits on my graph. Attempting to do the same on the command line ("BandageNG-9eb84c2-x86_64.AppImage querypaths graph.gfa protein.fasta prefix") returns an empty file. My understanding is that the protein alphabet should be automatically recognized, but is it the case? Attempting to force tblastn with the --balstp flag (Bandage querypaths graph.gfa protein.fasta prefix --blastp "tblastn -query protein.fasta -db all_nodes.fasta -outfmt 6 ") did not help and produced an error: Error: Too many positional arguments (1), the offending value: tblastn Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: tblastn I must be doing something wrong, but what? \Yann

asl commented 1 year ago

Hello

Yes, the sequence type should be determined automatically judging from the content. Even more, you might have both protein and nucleotide queries in the same file and it should run two processed: blastn and tblastn. However, the type detection might be not so perfect.

Will it be possible for you to attach your graph and input fasta file?

yjk-bertrand commented 1 year ago

Hi Anton, After experimenting with other queries, I am starting to believe that the issue could have something to do with different blast settings between GUI and command line. I have attached the test files. I would also appreciate some advice on the correct string format for the blastp parameter. I am confused about the file name required for the query: I assumed it should be the same as the one inputted for the main program but looking at the string example from the GUI seem to indicate that it should be 'queries.fasta'. Thanks for your help, Yann test_data.zip

yjk-bertrand commented 1 year ago

Hello, Sorry for insisting. Did you have a change to look at the graph and aa sequence? Thanks,

asl commented 1 year ago

Sorry, was AFK – will check it out

asl commented 1 year ago

Ok, so indeed there is the difference in the behavior and it is related to GFA paths.

  1. By default GUI aligns both to node sequences and GFA paths, if present.
  2. Node hits are then undergo some trivial chaining procedure to ensure that we can form the paths. This procedure is very inefficient, essentially we cannot join more than ~6 nodes (and this is why we are also aligning to the GFA paths)
  3. Symmetrically, we are trying to split path hits into node hits. However, these hits do not have many usual features like IDY / E-value, etc. as it is not possible to recompute those w/o full BLAST re-run.

On CLI we never include GFA paths to the BLAST db

Also, it seems that the filtering of path hits was not implemented, so all paths that you're seeing in the GUI are unfiltered path hits.

Screenshot 2023-09-22 at 15 07 13

I am going to fix the confusion:

asl commented 1 year ago

So, besides this paths hits were quite buggy. #141 introduced the necessary fixes. Now we correctly report zero hits as we enabled proper filtering :)