Download and preprocess Uniref50 - Githubissues

OpenBioML / protein-lm-scaling

Other

55 stars 14 forks source link

Download and preprocess Uniref50 #1

Closed pascalnotin closed 1 year ago

talkhanz commented 1 year ago

Hey @pascalnotin. This looks fun and I can work on it but do let me know what preprocessing steps are required?

zanussbaum commented 1 year ago

I actually have it processed here: https://huggingface.co/datasets/zpn/uniref50 but someone might want to double check that I did it correctly. I believe I filtered out O and U sequences?

zanussbaum commented 1 year ago

Uniref90 is also here: https://huggingface.co/datasets/zpn/uniref90

pascalnotin commented 1 year ago

Thanks @zanussbaum ! Could you please open a PR to merge in your data processing script for U50/U90?

Muedi commented 1 year ago

I guess we need to introduce white spaces and special tokens here too? Perhaps we could add this to your script when it's uploaded to have it in one go.

pascalnotin commented 1 year ago

@Muedi - to replace the undeterminate AAs (eg X, B, Z) or something else?

Muedi commented 1 year ago

I meant for the tokenizer. Good to raise the issue with undetermined AAs what is usually done with them?

jamaliki commented 1 year ago

We shouldn't add whitespace, this will double the size of the databases. Our tokenizer should easily handle normal AA sequences.

Muedi commented 1 year ago

Ah, I thought the ESM and ProtBert-tokenizers require the whitespaces, so I was not sure if it just works better that way :) Can this be changed in the config for the tokenizers?

zanussbaum commented 1 year ago

@pascalnotin i don't think i have access to the code anymore and i unfortunately never committed it :(

Muedi commented 1 year ago

@zanussbaum Do you still have access to/know the steps that you followed for preprocessing? In the HF the commits say you filtered O and U. What's the fix Data step for example? And are there versions for UniRef like e.g. in ENCODE? If so we'll need to know it I guess

jamaliki commented 1 year ago

Ah, I thought the ESM and ProtBert-tokenizers require the whitespaces, so I was not sure if it just works better that way :)

Can this be changed in the config for the tokenizers?

The Esm tokenizer supports normal AA sequences cause they use a trie under the hood, it isn't super fast though. The other option is that I could make a super simple rust tokenizer for our preprocessing so that it is fast enough. This wouldn't be difficult at all :)

Muedi commented 1 year ago

Sounds very cool! Perhaps we should discuss this Thursday all together :)

pascalnotin commented 1 year ago

Based on the discussion yesterday, the initial preprocessing (eg., handling of special tokens) will be lightweight. Perhaps we can close this issue and instead create two new ones revolving around:

Cross validation scheme as per https://github.com/orgs/OpenBioML/projects/8?pane=issue&itemId=35090107
Data sharding