A git clone that is more than ~5MB probably has some data files that could be best hosted eg in s3? - Githubissues

ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data

http://hypertools.readthedocs.io/en/latest/

MIT License

1.83k stars 160 forks source link

A git clone that is more than ~5MB probably has some data files that could be best hosted eg in s3? #84

Closed hughperkins closed 7 years ago

hughperkins commented 7 years ago

A git clone that is more than ~5MB probably has some data files that could be best hosted eg in s3? Or perhaps in google drive/dropbox.

andrewheusser commented 7 years ago

Thanks, yes, we intended on moving the data in examples/sample_data to a separate repo, but ran into some hiccups so decided to leave it in for now and move it for the next minor release. The plan is to setup a separate repo for the 3 sample datasets, and then create a function to load the data.

hughperkins commented 7 years ago

The plan is to setup a separate repo for the 3 sample datasets

Ok. Using github for binary data might not be the most efficient way, but it will work. I think using google drive is fairly standard, eg see https://github.com/zhangxiangxiao/Crepe , in the sectoin "Components"

jeremymanning commented 7 years ago

Is this related to the python 2 vs 3 issue? http://stackoverflow.com/questions/28218466/unpickling-a-python-2-object-with-python-3

jeremymanning commented 7 years ago

note: by "this" i mean the build error on the google-drive branch

jeremymanning commented 7 years ago

are we ready to pull google-drive into main and close this issue?

andrewheusser commented 7 years ago

for speed purposes, I'm thinking that i'll separate the 'weights' example data into one file with 2 group averages and one with just the first few subjects. As of now, it loads in the full dataset and it can take ~10s. If I separate, it'll be much quicker

jeremymanning commented 7 years ago

i'm ok either way, as long as we don't break code (and/or we should update the code for this repo's tests and in the examples repo).

i guess i'd lean slightly towards keeping the weights.mat file as is, even if it takes 10 seconds to download (it's a big file, after all). but i'll defer to what you think is best @andrewheusser...

andrewheusser commented 7 years ago

i think we can keep access avail to the full the weight.mat file, but also add the preprocessed data files for the examples....it makes it seem as though the software is slow, but its really just the file loading

jeremymanning commented 7 years ago

sounds good-- adding those additional files without removing the old one will keep backwards functionality but will allow the examples to run more quickly.

jeremymanning commented 7 years ago

one more thing i thought of: we should make sure the API specification on readthedocs is recompiled to include the new hyp.tools.load function.

jeremymanning commented 7 years ago

Shall we merge the pull request and close this issue?

andrewheusser commented 7 years ago

there is a sphinx-related bug in the readthedocs build that im trying to resolve. once that is done, ill merge google-drive into master

jeremymanning commented 7 years ago

Got it. Sounds good...

andrewheusser commented 7 years ago

this is good to go