J-SNACKKB / FLIP

A collection of tasks to probe the effectiveness of protein sequence representations in modeling aspects of protein design
Academic Free License v3.0
89 stars 14 forks source link

Wrong deletion masking for AAV task? #18

Open dlnp2 opened 2 years ago

dlnp2 commented 2 years ago

@sacdallago hi, thank you very much for your great data curation. I am planning to use the AAV dataset for my research.

I found that some deletion masks may not have been properly applied to the wild type sequences: as the image below shows, there are 29 sequences with different mutation_mask but with the same full_aa_sequnece as the wild type. Is this intended result?

スクリーンショット 2022-08-23 17 23 25

Below is the code for replication:

import pandas as pd
from Bio import SeqIO
wt_seq = str(next(SeqIO.parse("P03135.fasta", "fasta")).seq)
variant_effects = pd.read_csv("full_data.csv")
wild_types = variant_effects.loc[variant_effects["full_aa_sequence"] == wt_seq]
wild_types
alex-hh commented 1 year ago

I believe these may be sequences containing stop codons, which are sometimes represented with '*' (and is implied by these sequences having the value 'stop' in the category column). There are a few extra variants containing stop codons that end up with different sequences to those above due to also containing other mutations. If that's right then I think (i) all such variants should be excluded from all splits, since models do not encode the stop codon so cannot predict the fitnesses of these sequences (ii) the README file https://github.com/J-SNACKKB/FLIP/tree/main/splits/aav should be corrected to say that "*" in mutation mask and mutated region means stop codon and not deletion.

To identify all such rows:

import pandas as pd

variant_effects = pd.read_csv("full_data.csv")
stop_variants = variant_effects[variant_effects["category"]=="stop"]

This is equivalent to selecting all variants in which the mutation mask contains "*":

stop_variants = variant_effects[variant_effects["mutation_mask"].apply(lambda x: "*" in x)]

Some of these sequences contain stop codons which are effectively 'insertions' and some contain stop codons which are 'substitutions'. The two cases aren't distinguished by mutation_mask.