New task: Compile tokenizers

OpenBioML / chemnlp

ChemNLP project

MIT License

146 stars 45 forks source link

New task: Compile tokenizers #9

Open kjappelbaum opened 1 year ago

kjappelbaum commented 1 year ago

We need a mechanism that describes to which columns of a dataset a tokenizer applies (e.g., I think that we could use the identifier in meta.yaml for this).

Then, collect implementations for SMILES, SELFIES, InChI (?), IUPAC Name (?) tokenizers and describe in some way (registry pattern, decorator, ...) to which data types it applies to

mauryaland commented 1 year ago

Hi there!

Here is a great paper implementing BPE tokenization for SMILES representation. Here is the github repo. I will share if I find some other good resources.

kjappelbaum commented 1 year ago

As discussed in the last weekly meeting, we should also consider some of the things they did in the Galactica paper:

Screenshot 2023-03-28 at 07 44 04

They also represent the chemical reaction using LaTeX.

Screenshot 2023-03-28 at 07 44 52

I think those changes are reasonable, and we would need to process our text datasets ChemRxiv, BioRxiv, etc., accordingly. @MicPie, any chance we can get a copy/submodule/... of the Biorxiv code also into this repo? It would help other contributors.