Updated calc val accuracy function

brendanofallon / jovian

Detecting variants in NGS short read data by sequence-to-sequence modeling

Other

7 stars 1 forks source link

Updated calc val accuracy function #18

Closed abolia closed 2 years ago

abolia commented 2 years ago

This PR updates the "calc_val_accuracy" task in train.py. The new option now calculates mean variant count which is obtained after a Smith-Watterman alignment between predicted and target sequences and then count the number of variants. It also fixes the bug in shuffling loader where it was dropping reference read.

brendanofallon commented 2 years ago

That's a great idea Jacob! I forgot we had easy access to the ref string (it's in every sample, duh!).

abolia commented 2 years ago

This looks really good. I did notice one additional metric that might not be too hard to add or incorporate, something about FP rate. You could grab the ref string and compare it to tgtstr. If they match but the predstr shows a variant, you can add that to a sum of FPs. Then you could calc something like the FP rate for ref matching target strings. This doesn't seem critical but FPs might at least be nice to see next to mean var count.

Yeah, thats a great idea. Let me work on it.

abolia commented 2 years ago

Okay, I have added the code to calculate all three: TP, FP and FN based on ref, tgt and predicted sequence. I used the "eval_prediction" that was earlier in main.py to calculate these counts and moved it to train.py rather than duplicating the code. Let me know if you have any other idea to calculate these. I am outputting total sums rather than mean, but I can do 'mean' if you think that's better.

durtschi commented 2 years ago

Wow!, I really like this. It puts the results in an easy to relate context! I just noticed that the TP, FP, FNs might give useful PPA or PPV values. What do you think? We wouldn't have to do it here. We can also just do the calcs in a wandb plot (TP/(TP + FN), TP/(TP + FP)).

abolia commented 2 years ago

Wow!, I really like this. It puts the results in an easy to relate context! I just noticed that the TP, FP, FNs might give useful PPA or PPV values. What do you think? We wouldn't have to do it here. We can also just do the calcs in a wandb plot (TP/(TP + FN), TP/(TP + FP)).

I was thinking the same, it would be better to represent PPA/PPV instead. I changed the code to show PPA/PPV instead. The logger will have info on individual counts as well, incase one's interested.