Kohulan / DECIMER-Image_Transformer

DECIMER Image Transformer is a deep-learning-based tool designed for automated recognition of chemical structure images. Leveraging transformer architectures, the model converts chemical images into SMILES strings, enabling the digitization of chemical data from scanned documents, literature, and patents.
MIT License
216 stars 52 forks source link

How to train this DECIMER model #32

Closed NaphJohn closed 1 year ago

NaphJohn commented 1 year ago

Hi, Kohulan: When I was testing this model, I met one problem: test5 test6 I used train model but get wrong predictions: C(C1=[X2]=[X3]=NC2=C1N=C(N2)[R1])R(Cl)[R14] and CC(CC1=CN=C(C(=C1)N([R2])S(=O)(=O)C2=C(C(=C(C=C2Br)Br)[R20])[R12])[R1])C[X]

Do you have any advices to get high accuracy to test these pictures, we want to input these pictures and get right smiles? In my personal opinion, it is because there is no such sample when the model is trained, so can I use this decimerV-2 model that has been trained?

OBrink commented 1 year ago

Hey @whkaikai,

I see that you have opened a similar issue on the img2mol repository (you addressed me there, but I am not working with the Bayer ML team). I have no deep insight into their training data generation pipeline as it's not openly available as far as I know.

I think the problem is the same in both cases - As you have already pointed out, these types of structures are not included in the training data. To be fair, this is a rather difficult case. What SMILES output would you expect for these examples? I assume that A and B are non-specified ring systems. Is there even a way to express this type of unspecific structure in a SMILES string? The DECIMER training data is generated using our open-source tool RanDepict. Feel free to open an issue there or to contribute to the repository to make it capable of depicting this type of structure. I cannot make any promises about integrating this type of structure in the DECIMER training data quickly myself.

If you can artificially generate enough pairs of these types of images and the corresponding SMILES representations, you should be able to fine-tune the existing DECIMER Image Transformer model on it. Just adapt the training script according to your needs. With Img2Mol, you are probably going to run into the problem that the CDDD decoder is not capable of producing SMILES with R-group variables.

bdx114514 commented 1 year ago

Dear Otto Brinkhaus, Hi, I am whkaikai's friend. I want to fine-tune the existing DECIMER model in order to convert Markush structures to SMILES(these SMILES can not be recognized by rdkit. I call them'new SMILES' ). I have prepared some couples of 'Markush struture——new SMILES' as training and testing data set. For example: Markush structure: 217140122-9d7cb871-746e-4a8d-a2c9-5e4527ae95c8 its new SMILES: [R1]C1=[X1]C2=C(N1)N=[X3][X2]=C2A({[R4]m})([R5])([R6]) How many couples do you think I should prepare for training and testing?

OBrink commented 1 year ago

Dear @bdx114514,

Just out of curiosity - How are you generating these types of depictions? Do you have a way to artificially generate them? If yes, how? We are always looking for different depiction styles and ways to present chemical structures as potentially interesting features for future training datasets for DECIMER. This type of depiction is relatively unknown to me, and I would not know how to create it.

To put things into context - the latest DECIMER Image Transformer model has been trained on >450 million pairs of images and SMILES strings. We have successfully fine-tuned a model with ~100 million additional pairs, but it is likely that you can also yield good results with training datasets that are a lot smaller. We have never systematically tested how many data points are necessary to apply transfer learning here. I'd say, take the data that you have generated and try it! I would love to hear how it goes! :)

Have a nice day! Otto