facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

Question about aa position counts #240

Closed Didymos056 closed 2 years ago

Didymos056 commented 2 years ago

NOTE: if this is not a bug report, please use the GitHub Discussions for support questions (How do I do X?), feature requests, ideas, showcasing new applications, etc.

Bug description I often see something like "N665K, F104X" in the console output when running code. Positioning found that when you count the mutation sites, the position corresponding to "-" will be skipped. But when the calculation grammatically changes, it will directly use seq[pos] to make assertions. Is this the cause of my problem? Whether mut_probs corresponding to pos should skip "-".

Reproduction steps

def get_mutations(seq1, seq2):
    mutations = []
    from Bio import pairwise2
    alignment = pairwise2.align.globalms(
        seq1, seq2, 5, -5, -3, -.1, one_alignment_only=True,
    )[0]
    pos = 0
    for ch1, ch2 in zip(alignment[0], alignment[1]):
        if ch1 != ch2 and ch1 != '-' and ch2 != '-':
            mutations.append('{}{}{}'.format(ch1, pos + 1, ch2))
        elif ch1 == '-' and ch2 != '-':
            mutations.append('{}ins{}'.format(pos + 1, ch2))
        elif ch1 != '-' and ch2 == '-':
            mutations.append('{}{}del'.format(ch1, pos + 1))
        # if ch1 != '-':
        pos += 1
    return mutations
def grammaticality_change(word_pos_prob, seq, mutations, args, vocabulary, model,
                          verbose=False,):
    if len(mutations) == 0:
        return 0

    mut_probs = []
    for mutation in mutations:
        if 'del' in mutation or 'ins' in mutation:
            continue
        aa_orig = mutation[0]
        aa_pos = int(mutation[1:-1]) - 1
        aa_mut = mutation[-1]
        if (seq[aa_pos] != aa_orig):
            print(mutation)
        assert(seq[aa_pos] == aa_orig)
        mut_probs.append(word_pos_prob[(aa_mut, aa_pos)])
    if len(mut_probs)==0:
        return []
    else:
        # print(mut_probs)
        return np.mean(np.log10(mut_probs))

Expected behavior Give a clear and concise description of what you expected to happen.

Logs Please paste the command line output: `*****.py:80: RuntimeWarning: invalid value encountered in log10 return np.mean(np.log10(mut_probs)) R455L

R455L

I104T

A225V

T98I

N683K`

Output goes here

Additional context Add any other context about the problem here. (like proxy settings, network setup, overall goals, etc.)

tomsercu commented 2 years ago

Sorry I don't understand the issue. The console output seems to correspond to your print statement in the code. What did you expect to happen vs what actually did happen?

Didymos056 commented 2 years ago

Sorry, my English is poor. What I want to express is that this mutation is printed because of this judgment when calculating the grammar, and this phenomenon should not be expected, right?

        if (seq[aa_pos] != aa_orig):
            print(mutation)

I think the main reason for this is simply skipping the count when the sequence contains "-", resulting in getting an error pos。

    for ch1, ch2 in zip(alignment[0], alignment[1]):
        if ch1 != ch2 and ch1 != '-' and ch2 != '-':
            mutations.append('{}{}{}'.format(ch1, pos + 1, ch2))
        elif ch1 == '-' and ch2 != '-':
            mutations.append('{}ins{}'.format(pos + 1, ch2))
        elif ch1 != '-' and ch2 == '-':
            mutations.append('{}{}del'.format(ch1, pos + 1))
        if ch1 != '-':
            pos += 1
    return mutations
tomsercu commented 2 years ago

Sorry it's still unclear to me -- this seems to be an issue with the input data? In that case, this may be the wrong place to ask this question..