choderalab / ensembler-manuscripts

Manuscript for Ensembler v1
0 stars 3 forks source link

Dataset to accompany manuscript #38

Open danielparton opened 9 years ago

danielparton commented 9 years ago

This is the dataset: https://github.com/choderalab/ensembler-manuscripts/tree/master/dataset-for-publication The README should explain everything.

To make this available, the plan is to put the contents of that directory into a tar archive (6.2 GB) and upload to Dryad Digital Repository (http://datadryad.org/).

Note that there will likely be an $80 one-off fee at the time of article acceptance: This is described in the Dryad FAQ: http://datadryad.org/pages/faq#deposit Is this ok?

And are we happy with this dataset? I think it should cover everything needed, but let me know if you can think of anything that should be added or modified.

jchodera commented 9 years ago

This is pretty awesome!

No problem regarding the $80 Dryad fee. Worth a try.

Some comments:

jchodera commented 9 years ago

We should also make sure @sonyahanson, @pgrinaway, and @kyleabeauchamp take a look!

jchodera commented 9 years ago

If the ensembler/supporting-info/ directory is deprecated, you can git rm it.

danielparton commented 9 years ago

"If the ensembler/supporting-info/ directory is deprecated, you can git rm it." Done.

jchodera commented 9 years ago

For commands.sh, maybe we want to also include a little bit of code that creates a conda environment and uses the exact version of ensembler needed to generate the data in the paper?

I think that would be something like this:

conda create -c https://conda.binstar.org/omnia -p ~/anaconda/envs/ensembler python=2.7 ensembler=0.2 --yes

where the ensembler release version would replace the 0.2.

jchodera commented 9 years ago

You might also need conda activate ensembler after that.

danielparton commented 9 years ago

"* I hadn't realized GitHub renders csv files so nicely. We may not even need the txt versions of your csv files since they are already human-readable through GitHub, though I can't see any harm in leaving them in.

The model XTC trajectories are about 60-80 MB for each target, totaling 6.1 GB. The max size for a GitHub repo is 1 GB, hence I did not add these to the repo.

So right now I'm thinking the main way to access the dataset would be to download a zip or tgz archive from Dryad. This is why I included the .txt table versions of the csv files in the dataset.

jchodera commented 9 years ago

The model XTC trajectories are about 60-80 MB for each target, totaling 6.1 GB. The max size for a GitHub repo is 1 GB, hence I did not add these to the repo.

Got it. I hadn't realized that you had just omitted these---makes sense!

Note 1GB is the maximum recommended size. I think it just becomes crazy to work with after that. GitHub also doesn't like >50MB files---that might be the harder limit.

So right now I'm thinking the main way to access the dataset would be to download a zip or tgz archive from Dryad. This is why I included the .txt table versions of the csv files in the dataset.

Sounds good.

danielparton commented 9 years ago

Ok, I've added an explanation of command.sh in the README.

jchodera commented 9 years ago

Thanks! Let me make a few edits to the README.

jchodera commented 9 years ago

Actually, I'm still trying to check out the repo. It seems to have exploded in size...

jchodera commented 9 years ago

OK, I've made my edits in a PR: https://github.com/choderalab/ensembler-manuscripts/pull/39

I was mostly worried the existing text, although good, was backwards. The command was listed and then its purpose was stated afterwards. Instead, I moved the explanation to precede the command and added section subheadings for each step. Feel free to edit as appropriate!

jchodera commented 9 years ago

Here's a preview of the edited README: https://github.com/jchodera/ms-ensembler/blob/update-dataset-README/dataset-for-publication/README.md

jchodera commented 9 years ago

We may still want to add this line to the README.md and commands.sh:

conda create -c https://conda.binstar.org/omnia -p ~/anaconda/envs/ensembler python=2.7 ensembler=0.2 --yes

modified to match whatever release version you cut of ensembler to correspond with the paper.

jchodera commented 9 years ago

It will also be good post a link to the dataset and bioRxiv manuscript on our choderalab data page when this is up on Dryad!