AIforGreatGood / biotransfer

Machine learning-driven antibody design
Other
51 stars 15 forks source link

Question about threshold value of fitness function on CovidDesignDataset.py #6

Closed deepbiolab closed 2 months ago

deepbiolab commented 1 year ago

Dear @linnlii , Thanks for your teams work, this project is really helpful for me and I am trying to understand your code with related data you published. As your paper said "the estimated binding affinity aff(x) of the sequence x is better than the threshold σ. The threshold was set to the averaged assayed value of Ab-14 in the training data. ", and according this information I found this value is setted at the CovidDesignDataset.py - a variable called

seed_sequence_value = {
        "14H": 1.3445717628028468,
        "14L": 1.3445717628028468, #mu=1.0022633, std=0.17974249 (predicted)
        "91H": 1.88361976098417,
        "95L": 2.3378946100102533,
    }

, but based the dataset 1 in this link (https://github.com/mit-ll/AlphaSeq_Antibody_Dataset), I cannot calculate the mean value =1.3445717628028468, so I am wondering if this value calculated from the data corresponding with ab-14(both heavy and light chain), does it has some special method for calculation?

linnlii commented 1 year ago

Thanks for your question. One thing to note in the raw data is that the experimental values corresponding to the heavy chain of the candidate are identified to be outliers, while the experimental values corresponding to the light chain of the candidate are more consistent with what we know from the phage display. Since the measurements for both the heavy and light chains correspond to the same scFv, we use the values for 14L in the training data. This specific problem is due to the uncertainty from the high-throughput experimentation. That being said, the mean sequence value for both 14L and 14H should be ~1.684, which is consistent with what we observed from the validation data. The values in the data file are outdated and will be updated soon.