Closed pascalnotin closed 1 year ago
I actually have it processed here: https://huggingface.co/datasets/zpn/uniref50 but someone might want to double check that I did it correctly. I believe I filtered out O and U sequences?
Uniref90 is also here: https://huggingface.co/datasets/zpn/uniref90
Thanks @zanussbaum ! Could you please open a PR to merge in your data processing script for U50/U90?
I guess we need to introduce white spaces and special tokens here too? Perhaps we could add this to your script when it's uploaded to have it in one go.
@Muedi - to replace the undeterminate AAs (eg X, B, Z) or something else?
I meant for the tokenizer. Good to raise the issue with undetermined AAs what is usually done with them?
We shouldn't add whitespace, this will double the size of the databases. Our tokenizer should easily handle normal AA sequences.
Ah, I thought the ESM and ProtBert-tokenizers require the whitespaces, so I was not sure if it just works better that way :) Can this be changed in the config for the tokenizers?
@pascalnotin i don't think i have access to the code anymore and i unfortunately never committed it :(
@zanussbaum Do you still have access to/know the steps that you followed for preprocessing? In the HF the commits say you filtered O and U. What's the fix Data step for example? And are there versions for UniRef like e.g. in ENCODE? If so we'll need to know it I guess
Ah, I thought the ESM and ProtBert-tokenizers require the whitespaces, so I was not sure if it just works better that way :)
Can this be changed in the config for the tokenizers?
The Esm tokenizer supports normal AA sequences cause they use a trie under the hood, it isn't super fast though. The other option is that I could make a super simple rust tokenizer for our preprocessing so that it is fast enough. This wouldn't be difficult at all :)
Sounds very cool! Perhaps we should discuss this Thursday all together :)
Based on the discussion yesterday, the initial preprocessing (eg., handling of special tokens) will be lightweight. Perhaps we can close this issue and instead create two new ones revolving around:
Hey @pascalnotin. This looks fun and I can work on it but do let me know what preprocessing steps are required?