Open huangtinglin opened 2 years ago
@huangtinglin, in testing some code I've been writing recently using esm1_t6, I've noticed that different batch sizes can sometimes give you slightly different results. I'm extracting embeddings (CPU bound), and I observe the behavior whether I'm running the extract.py
script from ESM or my own code that wraps the model. My best guess right now is that it's something to do with which algorithms PyTorch uses based on workload (see here). What are some summary stats on your differences other than the sum (mean, median, min, max)? If they're small, then I'd be curious if your observations are due to floating point errors with different algorithms being used for the different workloads. The differences that I typically observe are around 1e-5 or smaller -- MSA1b is much larger than esm1_t6, though, so I wouldn't be surprised if slightly larger differences could happen with MSA1b as it performs more operations to calculate the representations.
Thanks, @brucejwittmann. Actually, I found that the error is due to the scaling factor which is related to the number of rows. I have created a new issue regarding this matter, which can be found at https://github.com/facebookresearch/esm/issues/491.
Bug description I am running the pretrained MSA transformer (esm_msa1b_t12_100M_UR50S) on some MSAs with different numbers and lengths to generate the representations. Following the example shown in README, I apply batch_converter to process the MSAs and obtain the token tensor with padding. But the representations generated by the transformer don't match the results when the MSAs are fed into the model one at a time.
Reproduction steps Here is a simple example.
Expected behavior A target MSA's representation produced by feeding it into the MSA transformer with and without the other MSAs is identical.
Logs
Additional context Add any other context about the problem here. (like proxy settings, network setup, overall goals, etc.)