Multiple transcripts to a single database protein in tsv output file

000generic commented 7 years ago

Hi!

I'm working with crb-blast for the first time and was able to use it without any obvious trouble (exit status 0) - running a query squid transcriptome against target single genomes for 10 species. However, in going through the tsv file output, I am finding multiple query transcripts for a single target protein. For instance for squid-fly I ran:

crb-blast --query /workspace/eedsinger/017_khmer-diginorm-all-squid-sequences/step-09_chimera-removal/blastdb-ready_doryteuthis-pealeii-transcriptome_clc-trinity-idba-soap.tr --target $fasta --threads 8 --evalue 1e-3 --output annotation_squid-transcriptomeX$fasta_id2.tsv --verbose

and got:

Squidee121271 Fly7227:000138 42.89 1231 0.0 943.0 4200..682 270..1421 4267.0 1615.0 Squidee71531 Fly7227:000138 81.67 120 4.0e-65 215.0 363..4 794..913 365.0 1615.0 Squidee117771 Fly7227:000138 36.83 915 2.0e-180 568.0 294..2837 58..907 2837.0 1615.0 Squidee137188 Fly7227:000138 79.82 114 3.0e-67 221.0 342..1 787..900 343.0 1615.0 Squidee166626 Fly7227:000138 33.74 1559 0.0 780.0 2631..6812 24..1512 7370.0 1615.0 Squidee183020 Fly7227:000138 34.80 1135 0.0 639.0 4168..1082 27..1113 6828.0 1615.0 Squidee395461 Fly7227:000138 48.93 609 0.0 622.0 2080..299 513..1081 2081.0 1615.0

I checked the fly protein fasta that was used as a target protein set and there is only one entry for the recurring id:

[eedsinger@cluster5 sequences]$ grep -i 'Fly__7227:000138' drosophila-melanogaster-genome.fa

Fly__7227:000138 FBgn0001311

If crb-blast is performing a RBH analysis, my understanding is that there should only be 1:1 results of query transcripts to target proteins between squid and fly - or do I need to parse the tsv file to identify RBHs...?

Thank-you, Eric

charlesfeigin commented 6 years ago

Hi, a resolution to this query would be greatly appreciated. I have found the same issue when running crb blast. In the most egregious case (where one transcript was found 49 times), I saw that the e-values were uniformly 0.0, however I can see that this is not the case in Erics example.

An explanation of the program's behavior would help me to filter my results to be equivalent to a true reciprocal best blast, which has 1:1 paired outputs.

charlesfeigin commented 6 years ago

It kind of looks like crb-blast will output multiple results when two or more sequences return identical e values. This would explain why in my own case, multiple targets with e values of 0.0 were output.

Eyeballing my results and Erics, this is also happening when e values are asymptotically small.

My best guess is that when the e values are this small, they exceed the rounding error of standard data structures used in the programming language the program is written in (Ruby?), and that behind the scenes they are being rounded to 0.0. Ergo, the multiple hits are being called as having identical e values and spit out even though their true values are technically different.

This should be a minor fix in your code.

fciamponi commented 4 years ago

Hi! I'm currtently using crb-blast for annotating predicted yeast genes against the S288C reference and facing the a similar issue. Is there any guidelines for adressing this issue with crb-blast? Or should I perform a manual best reciprocal selection based on e-value and target similarity?

Asutu commented 4 years ago

For what I can understand, I believe this is actually a feature of the CRB-BLAST method, and not a bug, although the authors could jump in and help clarify this issue.

The authors claim this method is an implementation of Aubry et al. 2014, so looking at the supplementary information in this paper where the method was originally described can be useful. They clearly state:

However, as reciprocal best-BLAST methods produce a one-to-one orthology map they are poorly suited for the analysis of transcriptome data and information would likely be lost due to the increased redundancy arising from multiple assembled transcript variants generated during assembly. We therefore developed a novel supervised machine learning method which learns sequence similarity parameters for likely transcript variants from a pairwise reciprocal best-BLAST analysis and then uses this information to classify non-reciprocating BLAST hits as either homologous or non-homologous transcripts.

Further below; the method starts by finding the reciprocal best hits and accept these as "valid" orthologues. Then, the method computes a function from lengths and e-values that is used to recover other putative homologues that initially were non-reciprocating hits:

The function describing this curve is then used to classify all non-reciprocating transcripts of any given length; those above the curve are accepted as putative homologues and those below are rejected.

blahah commented 4 years ago

As far as I can tell, @Asutu is correct: this not a bug but a feature. CRBB is an improvement over RBB precisely because it relieves the restriction of a single best hit which might miss many equally or almost equally good hits. We filter the hits in a context-dependent and statistically rigorous way, which gives you the freedom to select a single best hit as RBB does, however spurious it might be, or to delve into other information about the hits you get. Multiple candidates is the goal, and what you do with them depends on the biological context.

blahah commented 4 years ago

I should note that those floating point rounding issues highlight the underlying problem with RBB - statistically the difference is negligible and should be discarded. Take a look at the blast algorithm to see the many ways in which the score might be different in a biologically irrelevant way. Our paper has a figure which explains it I think:

10 1371%2Fjournal pgen 1004365 g002(1)

Identification of homologues and quantification of gene expression after de novo assembly, for full details see Text S1. (A) Correlation in quantification derived from reciprocal best BLAST (RBB) hits in the de novo assembly and reference summed over all transcript isoforms per reference gene locus. (B) The Spearman correlation in transcript abundances between the reference guided estimation and estimates generated using different transcript orthology assignment methods on the same de novo assembled transcriptome. “RBB only” means that only the reciprocal best BLAST transcripts were selected. E-value cut-offs (e.g. 1e-5) indicate the fixed value at which sequences were determined to be homologues. OrthoMCL indicates that OrthoMCL was used to cluster and identify orthologous transcript groups. Finally, the black bar indicates the effect of varying the percentile cut-off on the abundance estimate accuracy of the conditional orthology assignment method. (C) Conditional orthology assignment method begins by performing all versus all BLAST searches of the assembled transcripts against a reference proteome. (D) The reciprocating hits (indicated by blue lines) are selected for self-training. (E) The reciprocating hits are binned according to assembled transcript length and a quadratic model is fit to the e-value and length data. (F) Non-reciprocating hits which fall above the curve are accepted as putative homologues, non-reciprocating hits which fall below the curve are rejected. (G) Correlation in quantification derived from conditional assigned transcripts using species own reference genome. (H) Correlation in quantification derived from conditional assigned transcripts using intermediary reference genome. For full details, validation and explanation please see the supplementary methods (Text S1).

cboursnell / crb-blast

Multiple transcripts to a single database protein in tsv output file #11