Dataset Splits - Githubissues

ardigen / MAT

The official implementation of the Molecule Attention Transformer.

MIT License

239 stars 57 forks source link

Dataset Splits #1

Closed lilleswing closed 4 years ago

lilleswing commented 4 years ago

Great Work!

Can you release the splits used for the 6 folds for each dataset? In figure 2 you mention that some were random and some were scaffold, but which was which was not discussed in either dataset section. The splits would be especially helpful the Estrogen datasets as grabbing data hitting a protein from CHEMBL is a tricky process and hard to do exactly the same way twice.

lilleswing commented 4 years ago

PS For those looking for the MetStab Dataset I went through the original and cleaned it into a single csv here.

metstab.txt

Mazzza commented 4 years ago

Thank You!

We used random split for FreeSolv, ESOL, and Met- Stab. For all the other datasets we used scaffold split (this information is included in section 4.1 - paragraph "Evaluation").

We will release our datasets and splits this weekend.

Mazzza commented 4 years ago

Datasets and splits are in data folder.