Open alexcg1 opened 2 years ago
Max is right, the evaluation has been the biggest blocker up to now.
I believe that a good baseline strategy could be to compare the search results to the CATH families, which is a large expert-curated classification dataset for proteins. The idea is that similar proteins should be part of the same CATH family, and we could see this in the embeddings of the model and even define a numerical metric.
@fissoreg I like this idea, the ProtTrans paper authors wrote that Evolutionary Information was the previous best technology before NPL overtook it. Comparison between the two could be insightful.
There could be merit in running a sequence alignment between the query protein and the top X results to give an additional metric. I imagine the latter could be interesting to academics and simple to implement.
Does someone want to reach out to the paper's authors? Since you two wrote the app, I'm happy for you to take point. Otherwise I can assign someone from my team
Describe the issue:
Mentions