BorgwardtLab / proteinshake

Protein structure datasets for machine learning.
https://proteinshake.ai
BSD 3-Clause "New" or "Revised" License
99 stars 8 forks source link

Type of Family Classification Dataset #266

Closed mahdip72 closed 8 months ago

mahdip72 commented 8 months ago

@timkucera I have a question, the paper says that family classification is a multi class task but some proteins have multiple Pfam annotations. For example. I have the following labels for one protein:

['PF02324', 'PF19127'] Could you explain to me which annotation should I consider? Or is it a multi label task actually.

timkucera commented 8 months ago

You’re correct that there are some proteins with multiple labels. In the paper we treat it as a multi-class task and only consider the first label (you can achieve this with a transform or pre_transform). In the future we will likely change this task to only include proteins where the family assignment is clear. You can already do that yourself with a pre_filter. Examples below.

def first_family_transform(data, protein_dict):
    return data, protein_dict["protein"]["Pfam"][0]

task = ProteinFamilyTask().to_point().np(pre_transform=first_family_transform)
def single_family_filter(data, protein_dict):
    return len(protein_dict["protein"]["Pfam"]) == 1

task = ProteinFamilyTask().to_point().np(pre_filter=single_family_filter)