Multiple fixes around path query hits

asl commented 1 year ago

Add possibility to include / exclude GFA path hits
Filter query path hits
Fix issues around path-derived hits
Report query start / end

Fixes #138 Fixes #139

yjk-bertrand commented 1 year ago

Thanks for all the fixes you have added to the CLI! I have tested the blast query with and without --gfapaths. Although I now get hits and no filtered path like you mentioned, I do believe that the filtering is too aggressive. I am using BandageNG to search for homologous proteins on a graph (assembly or blunted gaph ) in the spirit of https://github.com/ablab/orf-search. The test graph I shared with you contains at least two homologous paths to the query protein namely 1510+, 120+, 76-, 2176-, 2194-, 1992- and 1606+. The tblastn search correctly identified the homologous nodes (good Evalue and identity), despite that the graph sequences contain the matching exonic sequences and the non-matching introns. My initial idea was to use the unfiltered hit table while traversing the graph and explore every path that would contain at least some given the proper orientation of the hit and the place of the hit and then test them for putative homology. I would therefore ask you if it would be possible for the user either to modify the filtering parameters and change the stringency level (less stringent in my case) or simply to return the raw hit table from the GUI so the user would perform her own filtering. Thanks!

asl commented 1 year ago

I would therefore ask you if it would be possible for the user either to modify the filtering parameters and change the stringency level (less stringent in my case)

Well, the whole point of path search is to ensure that (almost) whole query is aligned to some path on the graph, if there are some gaps / partial hits, then one cannot in general deduce whether the alignment is proper or just happened by a coincidence. Chaining of seed matches is tricky. You might want to check GraphAligner / Orfograph papers for some insights, but in general – trivial strategies (including one that is implemented in Banadage) fail miserably :)

Still, there are some options that might relax the filtering (use --help to find them):

[Option Group: BLAST search]
  Options:
    --blastp TEXT               Parameters to be used by blastn and tblastn when conducting a BLAST search in Bandage-NG.
                                Format BLAST parameters exactly as they would be used for blastn/tblastn on the command line, and enclose them in quotes.
    --alfilter INT:INT in [1 - 1000000] [off]
                                Alignment length filter for BLAST hits. Hits with shorter alignments will be excluded
    --qcfilter FLOAT:FLOAT in [0 - 100] [off]
                                Query coverage filter for BLAST hits. Hits with less coverage will be excluded
    --ifilter FLOAT:FLOAT in [0 - 100] [off]
                                Identity filter for BLAST hits. Hits with less identity will be excluded
    --evfilter SCINOT:FLOAT in [1e-999 - 99] [off]
                                E-value filter for BLAST hits. Hits with larger e-values will be excluded
    --bsfilter FLOAT:FLOAT in [0 - 1e+06] [off]
                                Bit score filter for BLAST hits. Hits with lower bit scores will be excluded
[Option Group: BLAST query paths]
  These settings control how Bandage-NG searches for query paths after conducting a BLAST search
  Options:
    --pathnodes INT:INT in [1 - 50] [6]
                                The number of allowed nodes in a BLAST query path
    --minpatcov FLOAT:FLOAT in [0.15 - 1] [0.9]
                                Minimum fraction of a BLAST query which must be covered by a query path
    --minhitcov FLOAT:FLOAT in [0.15 - 1], or "off" [0.9]
                                Minimum fraction of a BLAST query which must be covered by BLAST hits in a query path
    --minmeanid FLOAT:FLOAT in [0 - 1], or "off" [0.5]
                                Minimum mean identity of BLAST hits in a query path
    --maxevprod SCINOT:FLOAT in [1e-999 - 99], or "off" [1e-10]
                                Maximum e-value product for all BLAST hits in a query path
    --minpatlen FLOAT:FLOAT in [0 - 10000], or "off" [0.95]
                                Minimum allowed relative path length as compared to the query
    --maxpatlen FLOAT:FLOAT in [0 - 10000], or "off" [1.05]
                                Maximum allowed relative path length as compared to the query
    --minlendis INT:INT in [-1000000 - 1000000], or "off" [off]
                                Minimum allowed length discrepancy (in bases) between a BLAST query and its path in the graph
    --maxlendis INT:INT in [-1000000 - 1000000], or "off" [off]
                                Maximum allowed length discrepancy (in bases) between a BLAST query and its path in the graph

Same options are available via Settings in GUI and / or BLAST filtering button

return the raw hit table from the GUI so the user would perform her own filtering.

These are just plain BLAST hits. Why can't you run your own BLAST search over node sequences and obtain the hit table in whatever format is desired?

yjk-bertrand commented 1 year ago

I am familiar with the GraphAligner / Orfograph papers (the former is a great piece of bioinformatics writing BTW). Ensuring that most of the query is aligned to a path without gap makes a lot of sense for a nucleotide/nucleotide search, much less so for a protein/nucleotide search were you try to match protein on DNA and therefore expect gaps in the path due to introns. The major difficulty I can see when chaining the seeds in case of gaps in the path is that, different purpose would require different chaining. Therefore it make sense not to have a generic chaining algorithm in Bandage. Point taken. In the most relaxed filtering, (--minpatcov 0.2 --minhitcov 0.2 --minpatlen 0 --gfapaths --minmeanid 0.1) paths are made of single nodes so no extension was performed, which is a bit strange since adjacent to the hit nodes, there is a node, fully covered by a query, that should have been included in the path. So there seems to be a small mystery there. I didn't run by own blast search because my understanding of how it was carried out in Bandage was that the search was not performed over individual nodes sequences but over paths. I see my mistake as I get now that the path is build a posteriori from the node hits. Then it is straightforward to do as you suggest. Thanks.

danydoerr commented 1 year ago

Running BLAST/tblastn standalone is probably the easiest solution to your issues, as Anton suggested. But from your description, you seem to be interested in running a split alignment in the sequence graph to account for splice variants (?). The tool of choice I'd suggest would be vg mpmap (https://github.com/vgteam/vg/wiki/Multipath-alignments-and-vg-mpmap). You also seem to be interested in protein graph alignment, so another tool I'd like to shamelessly advertise is PanPA, developed by one of my colleagues at the Marschall lab https://github.com/fawaz-dabbaghieh/PanPA.

asl commented 1 year ago

@yjk-bertrand Bandage's hit chaining is very primitive. Essentially it collects all the hits that are located near the start of the query (given the expected minimum hit coverage) and all hits near the end, then it calculates the range of allowed path lengths (again, given the settings), then performs exhaustive enumeration of all paths that fits within the length range.

Finally, results are filtered given hits and path coverages. In your particular case, it seems that adding --maxpatlen 2 produces much more results :)

yjk-bertrand commented 1 year ago

Indeed, it is a key parameter in this case. Thanks!

yjk-bertrand commented 1 year ago

@danydoerr Yes, I have protein sequence that I am aligning to split variant graph obtained by converting the assembly graph from spades into a Blunted graph. Because I am working on target captured data (hybseq technology) for protein families I usually obtain low depth assemblies with a lot of miss-assemblies in the flanking regions of reads. Therefore it is easier to spot and remove miss-assemblies by cutting dead ends in the graph. Because sequence similarity in the exonic regions are as low as 80% I decided to query the graph with protein sequences instead of DNA. Consequently, using GraphAligner or other DNA/DNA aligners is not an option. Following you advice, I tried PanPa, which is a very interesting tool (thanks for suggesting), yet I am not fully sure of its concrete application. I'd like to see a publication using it. Unfortunately, it doesn't perform well on my data: it returned a single path of nodes that are not connected on the graph. This result is really odd, I don't get it. We have been using vg on pangenomics projects. It feels like an overkill for such a tiny graph, besides my understanding is that it is design for mapping fastq reads on the graph not AA fasta. Please correct me if I am wrong. Anyhow, thanks for your suggestions, it made me discover another cool tool!

danydoerr commented 1 year ago

@yjk-bertrand Apologies for the delay. You are right--vg does not support protein-to-dna alignment, PanPA constructs+uses protein graphs. But both of these approaches allow you to do real graph alignment, and not just path-imposed alignment. I would have expected vg's multipath approach working well on exome data, because of its application in finding splice-variants in rnaseq data. If you have questions/issues on running PanPA, my colleague Fawaz Dabbaghieh will surely be happy to help you out.

asl / BandageNG

Multiple fixes around path query hits #141