Open kjappelbaum opened 1 year ago
As discussed in the last weekly meeting, we should also consider some of the things they did in the Galactica paper:
They also represent the chemical reaction using LaTeX.
I think those changes are reasonable, and we would need to process our text datasets ChemRxiv, BioRxiv, etc., accordingly. @MicPie, any chance we can get a copy/submodule/... of the Biorxiv code also into this repo? It would help other contributors.
We need a mechanism that describes to which columns of a dataset a tokenizer applies (e.g., I think that we could use the
identifier
inmeta.yaml
for this).Then, collect implementations for SMILES, SELFIES, InChI (?), IUPAC Name (?) tokenizers and describe in some way (registry pattern, decorator, ...) to which data types it applies to