Kohulan / Smiles-TO-iUpac-Translator

Transformer based SMILES to IUPAC Translator
MIT License
137 stars 27 forks source link

Ammonia #11

Closed ptrab closed 1 year ago

ptrab commented 2 years ago

Hi,

Today, when I tried to generate the SMILES string for 'ammonia', I got '[NH2+]' back, which is certainly wrong. >>> STOUT.translate_reverse('ammonia') '[NH2+]'

When I tried to convert 'Ammonia', I got back a mess of weird strings.

>>> STOUT.translate_reverse('Ammonia') '[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr'

I also tried the systematic name. >>> STOUT.translate_reverse('azane') 'N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.'

>>> STOUT.translate_reverse('Azane') '[15NH3]'

I'm not sure if this is intended and I guess the error is on my side, but could you please have a look? :)

In the other direction, it works well: >>> STOUT.translate_forward('N') 'azane'

>>> STOUT.translate_forward('[NH2+]') 'azanium'

Thank you Philipp

Kohulan commented 2 years ago

Hi @ptrab ,

Thanks a lot for the bug report.

STOUT was primarily trained on PubChem IUPAC names so at certain times it makes such mistakes. Also, you cannot use capitalized letters with STOUT because it was only trained on words with small letters. That is why you got such weird results.

We are looking into improving STOUT further using more examples. As we stated in our paper we would highly recommend using rule-based methods to translate SMILES to IUPCAN names. Also, you could try OPSIN, Which could translate IUPAC names to SMILES.

Kind regards, Kohulan

ptrab commented 2 years ago

To better understand the underlying machinery, would it help to train the model with randomly upper- and lower-case characters to be fixed on lower-case letters and make it more robust?

I think I saw from image processing papers where they trained their GANs by adding some artificial noise to the input images to train the model for "real world" images and not only for perfect synthetic images.