Closed bebatut closed 6 years ago
The e-value is only the probability of a random alignment. We must set a threshold to consider an alignement to be non-random. The is ~55 millions sequences in Uniref90 (ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.release_note), so an e-value of 1/55000000 (~1.8e-8) or less should ensure a non-random alignment over the whole database.
Yet this sole information does not take into account alignment length, mismatches nor mismatches location. The score may be a better metric.
We can automatize the extraction of the threshold by checking the number of sequences in the Uniref90 database automatically.
And adapt src/check_pyl_protein.py
to check first the e-value against the threshold and after the score value
We also can arbitrarily set the e-value threshold to 1e-10, which is enough for Uniref90 and Uniref100 and easier to compute :smile: We can filter alignment results according to the e-value, then select the best score.
Issue
If an extended sequence has a longer alignment to a protein sequence in the db but with some mismatchs, the e-value for the predicted sequence (without extension) may be better than the e-value for the extended sequence
Easy solution
Validation
More complex solution
@keuv-grvl What do you think?