HelloJocelynLu / t5chem

Transformer-based model for chemical reactions
MIT License
58 stars 14 forks source link

Data Provided through paper website is different than data being test on colab #16

Closed tkella47 closed 1 year ago

tkella47 commented 1 year ago

The Forward Reaction Prediction data downloaded to the Google Colab has a different format than the one provided at this link https://yzhang.hpc.nyu.edu/T5Chem/data/USPTO_MIT.tar.bz2 This is for Forward Reaction Prediction

This is for the.sourcefiles. On the colab, a sample line looks as such. NO.O=C1CCCc2ccccc21.CO.Cl>>

However, downloading, and unzipping from the website shows the data to look as COC(=O)Cc1cn(C)c2cc(O)ccc12.Cc1nn(-c2ccc(C(F)(F)F)cc2)cc1C(C)CO>CCCCP(CCCC)CCCC.Cc1ccccc1

The question is whether the results are different when obtained with Reactant.Reagent> format or the Reactant>Reagent> format

HelloJocelynLu commented 1 year ago

Hi,

The structure of USPTO_MIT.tar.bz2 is:

data/
    USPTO_MIT/
        MIT_mixed/
        MIT_separated/   << I guess the line you copied comes from this folder?

Here, "mixed" means reagents and reactants are mixed; "separated" means reagents and reactants are separated by ">". As mentioned in paper, we did BOTH trainings. The answer to the question "The question is whether the results are different when obtained with Reactant.Reagent> format or the Reactant>Reagent> format" is: Yes, they are different. See Table 3. Results for Forward Reaction Prediction in paper. Separated version has slightly higher top-k accuracy as you explicitly give more information to model. When you train your model, just make sure you keep your data format consistent during training and inference. The choice of whether "separated" or "mixed" should be used depends on your need.

Jocelyn

tkella47 commented 1 year ago

Sorry I was unclear. The data from USPTO_500_MIT (which is on the google collab for sample data), reports extremely high performance on reaction prediction. The paper mentions a data leak as the motivation behind this dataset. I saw the results between the Mixed, and Separated, but then USPTO 500 MIT is a different format then any of the previous.

Should the dataset USPTO_500_MIT be standardized to be more in line with USPTO_MIT?

USPTO_500_MIT NO.O=C1CCCc2ccccc21.CO.Cl>>

USPTO_MIT -- Mixed NO.O=C1CCCc2ccccc21.CO.Cl

USPTO_MIT -- Separated COC(=O)Cc1cn(C)c2cc(O)ccc12.Cc1nn(-c2ccc(C(F)(F)F)cc2)cc1C(C)CO>CCCCP(CCCC)CCCC.Cc1ccccc1

HelloJocelynLu commented 1 year ago

Hi, We do not have USPTO_500_MIT, so I assume you are referring to USPTO_500_MT, where "MT" means multi-task. The high top-k accuracy is due to the nature of USPTO_500_MT -- only 500 reaction classes are involved in the dataset (so this dataset is easier), and have nothing to do with the data format as well as data leakage. USPTO_500_MT uses the mixed inputs (no separation between reactants and reagents), so it is using the same format as USPTO_mixed. Again, "mixed" means reagents and reactants are mixed; "separated" means reagents and reactants are separated by ">". (The ending ">>" is trivial) "Should the dataset USPTO_500_MIT be standardized to be more in line with USPTO_MIT?" USPTO_500_MT uses the same input setting with MIT_mixed. So I don't really think one is more standardized than the other.

tkella47 commented 1 year ago

Apologies for the typo!

Ah thank you for the explanation. That cleared things up. I see now that the smaller dataset uses the mixed version.

Many thanks for your help