Closed yzhang-github-pub closed 1 year ago
I found similar issues with this (many repeated tokens) in different settings.
Interesting case study, thanks for sharing! Let's move this to discussions since this may be just a weird artefact of forcing that first token, or an issue with the forced sampling script. In either case not really anything actionable I can see from our side.
To express inverse folding designed proteins, the first residue must be 'M' (for Methionine) which is encoded by start codon. So I want to fix the first residue. I tested the modified version of 'sample_sequences.py' by @martinpacesa from #236 on a short protein sequence of 86 residues (starting with 'M' which is the only 'M' in the sequence). 5 sequences were sampled using the original script, and the count of 'M' is [0, 1, 0, 1, 1]. Using the modified version, the count is [15, 1, 37, 31, 1]. Temperature is set to 0.5.