Questions about evaluation metrics for sequence alignment

hughplay commented 9 months ago

Hi,

Your excellent work on the sequence alignment is awesome and inspiring.

Recently, I tested deepblast on the MALIDUP and MALISAM and find the results indeed is great. However, I am confused how the F1 score in the Table 2 is computed. I have tried to reproduce the score with my own evaluation pipeline, as well as computing the f1 based on the tp, fp, fn returned by the function alignment_score, but both results are far from the value given in the table. I think there must be some mistakes in my evaluation code.

The code for evaluting one sample based on alignment_score is like this:

EPS = 1e-8

model = load_model("deepblast-v3.ckpt", "prot_t5_xl_uniref50").cuda()

true_alignment = ...
pred_alignment = model.align(primary_sequence, target_sequence)

scores = alignment_score(true_alignment, pred_alignment)
tp, fp, fn = scores[0], scores[1], scores[2]

precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * precision * recall / (precision + recall + EPS)

Could you please provide guidance on the correct method for calculating the F1 score?

Thank you!

mortonjt commented 9 months ago

Hi, it looks like you are using the same methods that I used for evaluation. The original notebooks can be found on the zenodo linked in the paper : https://doi.org/10.5281/zenodo.7731163

hughplay commented 9 months ago

Thank you very much! I have reproduced the results.

The reason that I got wrong scores is that I first represented the alignment in another format, and my transforming function was not well tested and I obatained wrong alignment states for computing scores. 😭

hughplay commented 9 months ago

I'm back again.

I find the alignment score seems to be weired in some cases. According to my observation, it happens when the alignments starting with "21:", for example (MALIDUP, d1knca):

manual
SSITRSSVLDQEQLWGTLLASAAATRNPQVLADIGAEATDH-LSAAARHAALGAAAIMGMNNVFYRGRGFLE
:::::::::::::::::::::::::::::::::::::::::1::::::::::::::::::::::::::::::
MNIIANPGIPKANFELWSFAVSAINGCSHCLVAHEHTLRTVGVDREAIFEALKAAAIVSGVAQALATIEALS

deepblast
S-SITRSSVLDQEQLWGTLLASAAATRNPQVLADIGAEATDH-LSAAARHAALGAAAIM-GMNNVFYRGRGFLE
21::::::::::::::::::::::::::::::::::::::::1::::::::::::::::1:::::::::::2::
-MNIIANPGIPKANFELWSFAVSAINGCSHCLVAHEHTLRTVGVDREAIFEALKAAAIVSGVAQALATIEA-LS

true_edges:
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15), (16, 16), (17, 17), (18, 18), (19, 19), (20, 20), (21, 21), (22, 22), (23, 23), (24, 24), (25, 25), (26, 26), (27, 27), (28, 28), (29, 29), (30, 30), (31, 31), (32, 32), (33, 33), (34, 34), (35, 35), (36, 36), (37, 37), (38, 38), (39, 39), (40, 40), (41, 40), (42, 41), (43, 42), (44, 43), (45, 44), (46, 45), (47, 46), (48, 47), (49, 48), (50, 49), (51, 50), (52, 51), (53, 52), (54, 53), (55, 54), (56, 55), (57, 56), (58, 57), (59, 58), (60, 59), (61, 60), (62, 61), (63, 62), (64, 63), (65, 64), (66, 65), (67, 66), (68, 67), (69, 68), (70, 69), (71, 70)]
pred_edges:
[(0, 0), (1, 0), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 7), (9, 8), (10, 9), (11, 10), (12, 11), (13, 12), (14, 13), (15, 14), (16, 15), (17, 16), (18, 17), (19, 18), (20, 19), (21, 20), (22, 21), (23, 22), (24, 23), (25, 24), (26, 25), (27, 26), (28, 27), (29, 28), (30, 29), (31, 30), (32, 31), (33, 32), (34, 33), (35, 34), (36, 35), (37, 36), (38, 37), (39, 38), (40, 39), (41, 40), (42, 40), (43, 41), (44, 42), (45, 43), (46, 44), (47, 45), (48, 46), (49, 47), (50, 48), (51, 49), (52, 50), (53, 51), (54, 52), (55, 53), (56, 54), (57, 55), (58, 56), (59, 56), (60, 57), (61, 58), (62, 59), (63, 60), (64, 61), (65, 62), (66, 63), (67, 64), (68, 65), (69, 66), (70, 67), (70, 68), (71, 69), (72, 70)]

DeepBlast predicts pretty well in this case, but the f1 score is 0. I am confused about the evaluation method. What are the edges? Why we need to compute the edges first? And why the f1 score is 0 in this case?

mortonjt commented 5 months ago

Hi, the edges are the match coordinates between the two sequences.

Regarding f1 score, if there is an off-by-1 error, the f1 score can be zero, even if the structural similarity is preserved. This is why f1 score isn't a great metric (TM-score is more robust).

Regarding the edge alignments, indeed there are weird edge cases. This is partially due to the querks surrounding indels -- the current gap-position-specific scoring setup isn't ideal. And we don't have a concept of affine gap scoring (it turns out to be highly non-trivial to setup for differential dynamic programming). See the DEDAL paper on a discussion on this

Despite these setbacks, these edge cases doesn't seem to strong affect the TMscore, since superposition is still roughly the same.

flatironinstitute / deepblast

Questions about evaluation metrics for sequence alignment #153