Closed mahdip72 closed 8 months ago
You’re correct that there are some proteins with multiple labels. In the paper we treat it as a multi-class task and only consider the first label (you can achieve this with a transform or pre_transform). In the future we will likely change this task to only include proteins where the family assignment is clear. You can already do that yourself with a pre_filter. Examples below.
def first_family_transform(data, protein_dict):
return data, protein_dict["protein"]["Pfam"][0]
task = ProteinFamilyTask().to_point().np(pre_transform=first_family_transform)
def single_family_filter(data, protein_dict):
return len(protein_dict["protein"]["Pfam"]) == 1
task = ProteinFamilyTask().to_point().np(pre_filter=single_family_filter)
@timkucera I have a question, the paper says that family classification is a multi class task but some proteins have multiple Pfam annotations. For example. I have the following labels for one protein:
['PF02324', 'PF19127'] Could you explain to me which annotation should I consider? Or is it a multi label task actually.