Replicating the data splits

aga-relation commented 2 years ago

Hi, I am trying to replicate the data splits quoted in the paper and having issues.

There is no seed in the data processing script hence no way to replicate your train/valid/test splits. Instead, I am trying to connect the files mentioned in the processing script with the files in the data repository but there are no matching file names and the README doesn't explain it either. Could you please specify which data files were used for training, validation, evaluating on random sequences and evaluating on naturally-occurring sequences?

Thank you! :)

1edv commented 2 years ago

Hi,

Thank you for your question.

The files with matching names can be accessed from the CodeOcean capsule we shared with our publication: https://codeocean.com/capsule/8020974/tree/v1 . For instance, /data/Glu/_teX.h5 corresponds to the complex media training data, and /data/Glu/_vaX.h5 corresponds to the complex media validation data.

The full data (used to generate these splits) and high quality random test data can also be accessed in the data repository here:

https://zenodo.org/record/4436477/files/complex_media_training_data_Glu.txt?download=1 : This is the training data used for the complex media
https://zenodo.org/record/4436477/files/defined_media_training_data_SC_Ura.txt?download=1 : This is the training data used for the defined media
https://zenodo.org/record/4436477/files/Random_testdata_complex_media.txt?download=1 : This is the random test data used for the complex media
https://zenodo.org/record/4436477/files/Random_testdata_defined_media.csv?download=1 : This is the random test data used for the defined media

Good luck!

aga-relation commented 2 years ago

Great, thank you! Could you please clarify which value in the Random_testdata_defined_media.csv corresponds to gene expression?

1edv commented 2 years ago

The meanEL column corresponds to expression.

1edv / evolution

Replicating the data splits #8