Kohulan / DECIMER-Image_Transformer

DECIMER Image Transformer is a deep-learning-based tool designed for automated recognition of chemical structure images. Leveraging transformer architectures, the model converts chemical images into SMILES strings, enabling the digitization of chemical data from scanned documents, literature, and patents.
MIT License
216 stars 52 forks source link

training from scratch #29

Closed tulay closed 2 years ago

tulay commented 2 years ago

Hi,

Thank you for sharing pre-trained model that works great. I am trying to train your model with my own data and have several questions. I am assuming the code base should be related to the article but feel free to correct.

I first had to modify the strategy so I could train on GPU only. I generated molecule images using CDK as outlined in the paper, created tfrecords.

I initially used selfies as captions and trained a model, packaged it using the code on github but for some reason it was insensitive to the input image hence producing the same output for all inputs. I poked around the deployed model and it seemed like tokenizer was a smiles tokenizer not selfies.

I switched to smiles captions to try and obviously still the same issue and I thought it must be something with the image encoder. I looked at the automl code included in the repo and seemed like it had some issues so switched to the recent automl efficientv2b3. Training shows loss getting lower, accuracy on training data increasing (didn't remove commented validation tracking but I will). No longer insensitivity to the input images. Another issue I realized that preprocessing for efficientnet should map the images to [-1,1] but preprocessing used in decode_image method was doing something else which eventually calling keras_applications.imagenet_utils with torch mode. Then I noticed that the code is actually training the efficientnet architecture and not using pretrained models to extract features. There is also this difference that current code is using an earlier layer as image feature as opposed to the layers as pointed out in the article.

TL;DR

Kohulan commented 2 years ago

Hello @tulay ,

The artcle explains DECIMER v1.0 and the current main branch of the repository holds the code for DECIMER V2.0. If you want to work with DECIMER V1.0 please switch to that branch.

is the code in the repo used for training the deployed model?

was the image encoder (efficientnet) actually trained from scratch with the molecule images as opposed to using it only as feature extractor from pre-trained model?

So if you want to train the DECIMER V1 the code is here: https://github.com/Kohulan/DECIMER-Image_Transformer/tree/DECIMER_V1.0

OBrink commented 2 years ago

@tulay, We are currently working on a publication about the recent version of DECIMER, but we try to keep the repository as up-to-date as possible even when there is no publication out, yet. Sorry for the confusion!

In addition to what Kohulan has already stated, I would like to add a remark about the generation of training data. I would strongly recommend to not to only use the CDK for the training data generation. We have experienced a great boost in performance after putting extra effort into the generation of diverse structure depictions. Therefore we developed the (openly available) package RanDepict (Repository, Publication). It uses the CDK, RDKit, Indigo and PIKAChU to generate diverse structure depictions by pseudo-randomly scrambling all available parameters whenever a structure is depicted. This way, the training data better represents the diversity of chemical structures in the literature, and the model is not overfit to CDK's depiction style. There even is a script for the direct generation of tfrecord files available.

I hope that this helps! Have a nice day! Otto

tulay commented 2 years ago

Thanks @Kohulan for clarification.

@OBrink I agree with your comment regarding the benefit of using mixed-style training data. My main goal with training the model was to address R variables. The model that gets downloaded with the current main branch doesn't handle that. I am happy to test it when you release a new model. Any idea when would that happen?

I wasn't aware of RanDepict, thank you for pointing out to that. It looks like that code can generate variety of images. I will post my observations on that project re R variables.

Kohulan commented 2 years ago

@tulay We are already working on the model which detects R-Variables and it will be available soon. We cannot give a definite timeline but keep an eye on the DECIMER repository. RanDepict can generate images with R groups embedded.

tulay commented 2 years ago

Thank you!