flatironinstitute / deepblast

Neural Networks for Protein Sequence Alignment
BSD 3-Clause "New" or "Revised" License
114 stars 21 forks source link

Length bias in scores #88

Open konstin opened 3 years ago

konstin commented 3 years ago

I'm interested in using DeepBLAST for finding homologous pairs, so I need to rank alignments for a given query.

Based on #78, I built a script and a simple set for evaluating my code (I found out that the ValueError I initally got in #78 goes away when switching from CPU to GPU) . For both db and query, the dataset has 5 random sequences each for 10 randomly picked CATH superfamilies from the 20% redundancy reduced set of CATH. I found that even with with the norm_score, scores are biased towards long sequences, so that the longest or second longest sequence is generally considered the closest hit.

With scipy, I got a Pearson's correlation coefficient of 0.53 to 0.94 with different CATH subsets. I've also plotted each query against the mean norm_score against all db sequences, which looks similar across different sets:

len_vs_avg_score

I've used this script and this cath-db.fasta and this cath-query.fasta for the plot. I used the DeepBLAST version I got by merging #78 and #87.

Do you have any thoughts on how to obtain a score that is independent of the sequence length?

mortonjt commented 3 years ago

Hi @konstin , sorry for the delayed responses (currently in the midst of revisions again ...) This is an awesome finding! Thank you for posting this.

Regarding scores that are independent of sequence, this is something that we are still investigating (particularly for the next iteration with the paired HMMs).

If you need something at this very moment, I think the Karlin-Altschul statistics can be applied here. Namely the number of matches under the null model is expected to follow a Poisson distribution -- so I should expect that P-values / E-values can be generated in a similar fashion (note that these are all speculations, none of this is tested yet...).