Length bias in scores - Githubissues

I'm interested in using DeepBLAST for finding homologous pairs, so I need to rank alignments for a given query.

Based on #78, I built a script and a simple set for evaluating my code (I found out that the ValueError I initally got in #78 goes away when switching from CPU to GPU) . For both db and query, the dataset has 5 random sequences each for 10 randomly picked CATH superfamilies from the 20% redundancy reduced set of CATH. I found that even with with the norm_score, scores are biased towards long sequences, so that the longest or second longest sequence is generally considered the closest hit.

With scipy, I got a Pearson's correlation coefficient of 0.53 to 0.94 with different CATH subsets. I've also plotted each query against the mean norm_score against all db sequences, which looks similar across different sets:

len_vs_avg_score

I've used this script and this cath-db.fasta and this cath-query.fasta for the plot. I used the DeepBLAST version I got by merging #78 and #87.

Do you have any thoughts on how to obtain a score that is independent of the sequence length?

flatironinstitute / deepblast

Length bias in scores #88