bebatut / PylProtPredictor

Prediction of PYL proteins
http://bebatut.fr/PylProtPredictor/
Apache License 2.0
0 stars 0 forks source link

Improve the similarity search #5

Closed bebatut closed 6 years ago

bebatut commented 7 years ago

Issue

If an extended sequence has a longer alignment to a protein sequence in the db but with some mismatchs, the e-value for the predicted sequence (without extension) may be better than the e-value for the extended sequence

Easy solution

Validation

More complex solution

@keuv-grvl What do you think?

keuv-grvl commented 7 years ago

The e-value is only the probability of a random alignment. We must set a threshold to consider an alignement to be non-random. The is ~55 millions sequences in Uniref90 (ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.release_note), so an e-value of 1/55000000 (~1.8e-8) or less should ensure a non-random alignment over the whole database.

Yet this sole information does not take into account alignment length, mismatches nor mismatches location. The score may be a better metric.

bebatut commented 7 years ago

We can automatize the extraction of the threshold by checking the number of sequences in the Uniref90 database automatically. And adapt src/check_pyl_protein.py to check first the e-value against the threshold and after the score value

keuv-grvl commented 7 years ago

We also can arbitrarily set the e-value threshold to 1e-10, which is enough for Uniref90 and Uniref100 and easier to compute :smile: We can filter alignment results according to the e-value, then select the best score.