ai4protein / Pro-FSFP

Pro-FSFP: Few-Shot Protein Fitness Prediction
GNU General Public License v3.0
61 stars 3 forks source link

Confused about the MSA in this codebase #9

Open Amshoreline opened 3 hours ago

Amshoreline commented 3 hours ago

In the paper, the authors explain that during the "Search for relevant experimental datasets" phase, they selected two proteins from ProteinGym that are most similar to the target protein. For the third protein, they used MSA to find "mutation data" and applied GEMME to generate pseudo-labeled target data.

However, I couldn't find any MSA processing in the provided codebase. Specifically, in line 352 of this file, the value for args.meta_tasks is set to "3," and args.augment is set to "True." This indicates that the third protein is also selected from ProteinGym, but its labels are replaced with pseudo-labels generated by GEMME.

When I reran the code according to the instructions in "run.sh" (with a training dataset size of 40), the actual experiment I conducted was "LTR + LoRA + MTL (no MSA) is a variant of FSFP that does not depend on MSA to build auxiliary tasks. It replaces the third task of FSFP with another labeled dataset retrieved from the database.". The expected Spearman score should be around 0.47, but the result I obtained was 0.506, which is close to the score reported for LTR + LoRA + MTL (FSFP).

Could you please explain this confusing phenomenon? Thanks~ [图片]

Amshoreline commented 3 hours ago

image