Improve the similarity search

bebatut commented 7 years ago

Issue

If an extended sequence has a longer alignment to a protein sequence in the db but with some mismatchs, the e-value for the predicted sequence (without extension) may be better than the e-value for the extended sequence

Easy solution

Checking the e-value first
If the e-value is worst for the extended sequences, checking the alignment length

Validation

Conservation of the 20 best results for each sequences
Plotting an histogram of the identity percentage of the best results for each sequences
Plotting an histogram of the alignment length of the best results for each sequences
Plotting an histogram of the e-value of the best results for each sequences

More complex solution

Define a metrics combining the e-value, the alignment length and the identity percentage to decide which sequence can be the more appropriate one for the CDS (an extended sequence or the predicted sequence)

@keuv-grvl What do you think?

keuv-grvl commented 7 years ago

The e-value is only the probability of a random alignment. We must set a threshold to consider an alignement to be non-random. The is ~55 millions sequences in Uniref90 (ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.release_note), so an e-value of 1/55000000 (~1.8e-8) or less should ensure a non-random alignment over the whole database.

Yet this sole information does not take into account alignment length, mismatches nor mismatches location. The score may be a better metric.

bebatut commented 7 years ago

We can automatize the extraction of the threshold by checking the number of sequences in the Uniref90 database automatically. And adapt src/check_pyl_protein.py to check first the e-value against the threshold and after the score value

keuv-grvl commented 7 years ago

We also can arbitrarily set the e-value threshold to 1e-10, which is enough for Uniref90 and Uniref100 and easier to compute :smile: We can filter alignment results according to the e-value, then select the best score.

bebatut / PylProtPredictor