Dataset to accompany manuscript

danielparton commented 9 years ago

This is the dataset: https://github.com/choderalab/ensembler-manuscripts/tree/master/dataset-for-publication The README should explain everything.

To make this available, the plan is to put the contents of that directory into a tar archive (6.2 GB) and upload to Dryad Digital Repository (http://datadryad.org/).

Note that there will likely be an $80 one-off fee at the time of article acceptance: This is described in the Dryad FAQ: http://datadryad.org/pages/faq#deposit Is this ok?

And are we happy with this dataset? I think it should cover everything needed, but let me know if you can think of anything that should be added or modified.

jchodera commented 9 years ago

This is pretty awesome!

No problem regarding the $80 Dryad fee. Worth a try.

Some comments:

The README.md title renders a bit oddly because there is so much text packed in it. Maybe just change the title to "Supplementary data" and give the citation info below?
It would be great if the README.md had a step-by-step explanation of the contents of commands.sh. You have an explanation like this in the manuscript you can just lift and format.
I hadn't realized GitHub renders csv files so nicely. We may not even need the txt versions of your csv files since they are already human-readable through GitHub, though I can't see any harm in leaving them in.
Are the models in here too? The directories seem to only have models-data.csv and topology.pdb, e.g.: uscripts/tree/master/dataset-for-publication/models/ABL1_HUMAN_D0

jchodera commented 9 years ago

We should also make sure @sonyahanson, @pgrinaway, and @kyleabeauchamp take a look!

jchodera commented 9 years ago

If the ensembler/supporting-info/ directory is deprecated, you can git rm it.

danielparton commented 9 years ago

"If the ensembler/supporting-info/ directory is deprecated, you can git rm it." Done.

jchodera commented 9 years ago

For commands.sh, maybe we want to also include a little bit of code that creates a conda environment and uses the exact version of ensembler needed to generate the data in the paper?

I think that would be something like this:

conda create -c https://conda.binstar.org/omnia -p ~/anaconda/envs/ensembler python=2.7 ensembler=0.2 --yes

where the ensembler release version would replace the 0.2.

jchodera commented 9 years ago

You might also need conda activate ensembler after that.

danielparton commented 9 years ago

"* I hadn't realized GitHub renders csv files so nicely. We may not even need the txt versions of your csv files since they are already human-readable through GitHub, though I can't see any harm in leaving them in.

Are the models in here too? The directories seem to only have models-data.csv and topology.pdb, e.g.: uscripts/tree/master/dataset-for-publication/models/ABL1_HUMAN_D0"

The model XTC trajectories are about 60-80 MB for each target, totaling 6.1 GB. The max size for a GitHub repo is 1 GB, hence I did not add these to the repo.

So right now I'm thinking the main way to access the dataset would be to download a zip or tgz archive from Dryad. This is why I included the .txt table versions of the csv files in the dataset.

jchodera commented 9 years ago

The model XTC trajectories are about 60-80 MB for each target, totaling 6.1 GB. The max size for a GitHub repo is 1 GB, hence I did not add these to the repo.

Got it. I hadn't realized that you had just omitted these---makes sense!

Note 1GB is the maximum recommended size. I think it just becomes crazy to work with after that. GitHub also doesn't like >50MB files---that might be the harder limit.

So right now I'm thinking the main way to access the dataset would be to download a zip or tgz archive from Dryad. This is why I included the .txt table versions of the csv files in the dataset.

Sounds good.

danielparton commented 9 years ago

Ok, I've added an explanation of command.sh in the README.

jchodera commented 9 years ago

Thanks! Let me make a few edits to the README.

jchodera commented 9 years ago

Actually, I'm still trying to check out the repo. It seems to have exploded in size...

jchodera commented 9 years ago

OK, I've made my edits in a PR: https://github.com/choderalab/ensembler-manuscripts/pull/39

I was mostly worried the existing text, although good, was backwards. The command was listed and then its purpose was stated afterwards. Instead, I moved the explanation to precede the command and added section subheadings for each step. Feel free to edit as appropriate!

jchodera commented 9 years ago

Here's a preview of the edited README: https://github.com/jchodera/ms-ensembler/blob/update-dataset-README/dataset-for-publication/README.md

jchodera commented 9 years ago

We may still want to add this line to the README.md and commands.sh:

conda create -c https://conda.binstar.org/omnia -p ~/anaconda/envs/ensembler python=2.7 ensembler=0.2 --yes

modified to match whatever release version you cut of ensembler to correspond with the paper.

jchodera commented 9 years ago

It will also be good post a link to the dataset and bioRxiv manuscript on our choderalab data page when this is up on Dryad!

choderalab / ensembler-manuscripts

Dataset to accompany manuscript #38