cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

Git large-file storage #82

Open jruhym opened 7 years ago

jruhym commented 7 years ago

It might be useful to store the files currently being downloaded by 1.download.ipynb on git's large file storage. That way we can eliminate 1.download.ipynb and have the data files under version control. https://git-lfs.github.com/

It needs to be investigated whether git-lfs can be incorporated via conda into the environment automatically.

KT12 commented 7 years ago

Semi-related to this topic, would it make sense to pickle the X and Y matrices? It takes about 2 minutes for my machine to load each one every time I start a Notebook. I would think the un-pickling would be faster than this.

dhimmel commented 7 years ago

Semi-related to this topic, would it make sense to pickle the X and Y matrices?

@KT12 yes great point. cognoml currently does pickle for the reading speed-up.

I think it makes sense to use Git LFS to store these pickles in the machine-learning repo. And the cancer-data repo can store the compressed TSVs using Git LFS.

I reached out to GitHub support to see if we can get some LFS capacity.

jruhym commented 7 years ago

One can incorporate git-lfs via conda by adding the lines to the environment.yml channels: - defaults - conda-forge dependencies: . . . - git-lfs=1.5.5 This will incorporate the dependency as found here.

I have tested this on my machine and it works. I know that @dhimmel is waiting to hear back from GitHub, but I did attempt to test the Git LFS on a fork of machine-learning but ran into an issue where one cannot use Git LFS on a fork of a repo unless it is already used on the main project, as mentioned by technoweenie here. I ended up with the following error message, batch response: http: @jruhym can not upload new objects to public fork jruhym/machine-learning, when I tried to push.

dhimmel commented 7 years ago

@jruhym nice... git-lfs as part of the conda environment will add convenience. Regarding channels, I think we'll want defaults to precede conda-forge.

jruhym commented 7 years ago

@dhimmel I updated my comment to address your suggestion.

dhimmel commented 7 years ago

Okay we now have LFS capacity on GitHub through their education program! Thanks @github for the generosity.

I will submit a pull request on cancer-data to add git-lfs. Then we can upload pickled versions here.

dhimmel commented 7 years ago

See https://github.com/pandas-dev/pandas/pull/13317#issuecomment-283180782

Pickled files load almost instantly but are over 1 GB uncompressed.