davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
716 stars 189 forks source link

How to deal with BLASTP with multiple HSPs? #144

Closed ohdongha closed 5 years ago

ohdongha commented 6 years ago

Dear Authors, This is not to raise an issue, rather a question about the process.

I find BLAST often detect multiple High-scoring Segment Pairs (HSPs) from a comparison of a pair of sequences. Sometimes, parts of these multiple HSPs may overlap.

For example, BLAST returned two HSPs, the first at 310\~711 of the query sequence and the second at 695\~824 (overlapping with the first), with scores 181 and 116, respectively. In this case, how do you calculate the total bit score for these two sequences?

query subject %idn aln_l mis gap qs qe ss se ev sc
PHASIBEAM10B001168 AT3G44620.2 70.37 405 114 3 310 711 184 585 8.11E-45 181
PHASIBEAM10B001168 AT3G44620.2 80.916 131 23 2 695 824 638 767 2.83E-25 116

Thank you in advance.

Best regards, Dong-Ha

davidemms commented 6 years ago

Hi Dong-Ha

Thanks for your question. We considered this in developing the algorithm, looking at if we should sum non-overlapping hits. this didn't have much on an effect on the results and so we just take the single best hit between the sequences.

All the best David

ohdongha commented 6 years ago

Thanks for the clarification and sorry for the delayed response. I was traveling and thinking of replying after coming back, but the case was closed already.

Anyways, I have a few examples where this issue may affect the orthology inference, especially when we tried to dissect the clusters into a locus-level resolution.

I haven't looked at OrthoFinder results yet, but OrthoMCL often calls orthologs clearly derived from two or three different loci, and also functionally diverged with a different protein domain organization, into one ortho-group.

For example (excuse me for telling a bit long story here), OrthoMCL calls, NHX7 and NHX8, two plant ion transporter genes, as in one cluster (ortho-group). NHX7 and NHX8 have different functions and domain organizations, although one of the domains are highly conserved. We tried to further separate this cluster by looking at "best blast-hit pairs" (or RBHs) among the member of the cluster. While all NHX8s found NHX8s from other species as RBHs, one CrNHX8 (Cru|20890420) ortholog failed to be found as an RBH by other NHX8s. Instead, all other NHX8s found CrNHX7 (Cru|20887913) from all other species as its BH partners. 160525_6crucifersorthnet_example_sos1-nhx8[In this figure, the left hexagon represents NHX7s from six species, while the right hexagon NHX8s, including paralogs in some species. Solid gray arrows connect reciprocal best-hit pairs, while dashed ones represent unidirectional best-hits]

I looked at the blast result and realized that there were two HSPs in all blast results with the CrNHX8 (Cru|20890420) as the target. If I consider both HSPs, CrNHX8 would be found as the best-hit by other NHX8s. The above figure was the result when only the best HSP was considered.

Of course, this could be resolved by using all-vs-all blast with normalizations, with higher inflation (-I) value for MCL, if needed (haven't run OrthoFinder on these species yet). But if there is a way to combine and consider all HSPs from the beginning, I think it would make the life simpler and more precise.

...

Assuming that if one has to consider the entire HSPs, what would you advise as for the best approach? Overlapping HSPs would make the simple addition of bit scores complicated, especially so when comparing proteins consisted of different numbers of repetitive domains.

Or would there be other tools which guarantee to give a one-line statistics for the degree of sequence similarity?

davidemms commented 6 years ago

Hi Thanks for the extra info, it's a nice example to consider. Looking at how to deal with multi-domain proteins with possible rearrangements is something I would like to investigate more thoroughly at some point for OrthoFinder and this will be a good example. I would be interested to know what result you get running OrthoFinder on these species. It has a better way of creating the graph for MCL than the simple (R)BH method you show above. Unfortunately, I don't currently have a method to suggest for dealing with multiple potentially overlapping HSP.

All the best David

ohdongha commented 6 years ago

Thanks, David. I will try OrthoFinder in these six species and see what happens to these particular genes. All-vs-all networks will have more edges than (R)BH networks, may react differently to MCL, and may be more tolerant to some of the edges being distorted by the "multiple HSP" problem.

...

By the way, the above example was eventually separable using MCL, with higher edge weights given to reciprocal edges than uni-directional edges.

In the above graph, gray edges connect co-linear orthologs (i.e. showing synteny) while red edges show lack of co-linearity/synteny (i.e. transposed) among different species. I have been working on creating a network representation of co-linear and transposed genes including all duplicated paralogs, where edges are weighted for the presence or absence of co-linearity in gene orders among multiple genomes and clustered by MCL (preprint: https://doi.org/10.1101/236299).

The above example was luckily resolved by MCL, but MCL also appeared to sometimes further break good clusters derived from an isolated locus into even smaller ones. I have been working on optimizing edge weights and other MCL parameters. And I feel solving the issue with multiple HSPs in determining the best/true (R)BH could be a way to improve our approach. I also found some useful insights in designing filters for (R)BH edges to include in our networks, from your OrthoFinder paper. I should have tried your tool earlier, but for some reason, I was able to make it running on our server only very recently.

I digressed. I will be back with the OrthoFinder run on these species.

Cheers, Dong-Ha

ohdongha commented 6 years ago

Dear David, I think MMseqs2 (https://github.com/soedinglab/mmseqs2) (DOI: 10.1038/nbt.3988) could be what I've been looking for. I just tested it and at least it did produce a single HSP for each query-subject (target) pair, with a single bit score / e-value. Moreover, it was super-fast. One pairwise comparison between two genomes took about ~30 seconds.

I will now compare the result from MMseq2 with previous ones using BLASTP. Meanwhile, it would be great if OrthoFinder have an option to use MMseqs2 instead of BLASTP. It offers BLAST m8 table-type output option. I will be back again with MMseqs2-BLASTP comparison, i.e. whether MMseqs2 solved the issue I described above.

Cheers, Dong-Ha

davidemms commented 6 years ago

Hi Dong-Ha

Just to let you know, MMseqs2 is now available in OrthoFinder.

All the best David

ohdongha commented 6 years ago

Thanks, David. I have been using MMseqs2 with the OrthNet (the (R)BH network with co-linearity information I showed above; now online on DNA Res) and it has been working quite well. OrthNet does generate a similar result with OrthoFinder for closely-related genomes, with differences among some of the clusters perhaps due to the incongruence between co-linearity and sequence similarity, which I am currently looking into. This is good news because now I can use MMseqs2 for both tools for comparison. MMseqs2 generate all-to-all results much faster than BLASTP, so I guess the running time can be further shortened for OrthoFinder as well.

I will try running OrthoFinder with MMseqs2 and get back to you if there is an issue.

davidemms commented 5 years ago

Closing this as I think there isn't any software issue here although the details for the example gene family are interesting to consider.

Thanks David