Open jruhym opened 7 years ago
Semi-related to this topic, would it make sense to pickle the X
and Y
matrices? It takes about 2 minutes for my machine to load each one every time I start a Notebook. I would think the un-pickling would be faster than this.
Semi-related to this topic, would it make sense to pickle the X and Y matrices?
@KT12 yes great point. cognoml
currently does pickle for the reading speed-up.
I think it makes sense to use Git LFS to store these pickles in the machine-learning
repo. And the cancer-data
repo can store the compressed TSVs using Git LFS.
I reached out to GitHub support to see if we can get some LFS capacity.
One can incorporate git-lfs
via conda
by adding the lines to the environment.yml
channels:
- defaults
- conda-forge
dependencies:
.
.
.
- git-lfs=1.5.5
This will incorporate the dependency as found here.
I have tested this on my machine and it works. I know that @dhimmel is waiting to hear back from GitHub, but I did attempt to test the Git LFS on a fork of machine-learning
but ran into an issue where one cannot use Git LFS on a fork of a repo unless it is already used on the main project, as mentioned by technoweenie here. I ended up with the following error message,
batch response: http: @jruhym can not upload new objects to public fork jruhym/machine-learning
,
when I tried to push.
@jruhym nice... git-lfs as part of the conda environment will add convenience. Regarding channels
, I think we'll want defaults to precede conda-forge.
@dhimmel I updated my comment to address your suggestion.
Okay we now have LFS capacity on GitHub through their education program! Thanks @github for the generosity.
I will submit a pull request on cancer-data
to add git-lfs. Then we can upload pickled versions here.
See https://github.com/pandas-dev/pandas/pull/13317#issuecomment-283180782
Pickled files load almost instantly but are over 1 GB uncompressed.
It might be useful to store the files currently being downloaded by
1.download.ipynb
on git's large file storage. That way we can eliminate1.download.ipynb
and have the data files under version control. https://git-lfs.github.com/It needs to be investigated whether git-lfs can be incorporated via conda into the environment automatically.