Results quality? - Githubissues

georgeamccarthy / protein_search

The neural search engine for proteins.

GNU Affero General Public License v3.0

15 stars 6 forks source link

Results quality? #61

Open alexcg1 opened 2 years ago

alexcg1 commented 2 years ago

Describe the issue:

I was chatting with Max at Jina and he mentioned that it's difficult to ascertain the quality of results for the protein search. Do you know anyone I could reach out to who might be able to evaluate that? I'm happy to get a hosted instance up and running for that purpose, and if it returns quality results we can keep it up as a tech demo

Mentions

fissoreg commented 2 years ago

Max is right, the evaluation has been the biggest blocker up to now.

I believe that a good baseline strategy could be to compare the search results to the CATH families, which is a large expert-curated classification dataset for proteins. The idea is that similar proteins should be part of the same CATH family, and we could see this in the embeddings of the model and even define a numerical metric.

georgeamccarthy commented 2 years ago

@fissoreg I like this idea, the ProtTrans paper authors wrote that Evolutionary Information was the previous best technology before NPL overtook it. Comparison between the two could be insightful.

There could be merit in running a sequence alignment between the query protein and the top X results to give an additional metric. I imagine the latter could be interesting to academics and simple to implement.

alexcg1 commented 2 years ago

Does someone want to reach out to the paper's authors? Since you two wrote the app, I'm happy for you to take point. Otherwise I can assign someone from my team