datapythonista / mnist

Python utilities to download and parse the MNIST dataset
BSD 3-Clause "New" or "Revised" License
131 stars 59 forks source link

connect to Intake #13

Open martindurant opened 6 years ago

martindurant commented 6 years ago

I don't know if you are aware of intake, but it is a data access and cataloguing package that aims to do a lot of what you have done here, but for generic data-sets rather than the one specific example.

Firstly. the existing npy data source type shows how you might use intake on array data; note that the use of open_files ( here in the code ) already allows access to data on remote file-systems (s3, gcs, http...) with optional compression, and the caching system handles download-on-first-use, again with various possible file layouts at the far end.

You would still need some of your code for the specifics of the format of the mnist data, but I believe you could make your work much smaller and structured, and allow it to be included in other catalogues, or indeed as a conda package.

datapythonista commented 6 years ago

That looks interesting, had a quick look at intake before. But I'll probably leave the project of porting this to intake to someone else. I don't really use mnist myself, just released this code from some experiments with Hopfield networks, but I don't really have the time for big refactorings.

Hopefully someone else can reuse the code here to build it.

martindurant commented 6 years ago

I guess I'll put it somewhere on the my list, but I don't anticipate anything immediate.

martindurant commented 6 years ago

@datapythonista , I have implemented this here: https://github.com/martindurant/mnist-data-intake

Note that this requires the latest master version of Intake (because the path munging while decompressing has a bug), so the package requirements in the conda file are not yet updated. I'll push a package up when Intake is released again.

datapythonista commented 6 years ago

Thanks for sharing @martindurant. Do you think it makes sense to move your class MNISTImagesPlugin to an intake.py file in my repo? Not sure if many people is using my project, but I'd like to keep compatibility on how it works now. But if we can have the intake plugin in the same repo, so the package can work in both ways, that seems like the best approach to me.

If that sounds like a good idea, I can try to find the time to move it. Or feel free to send a PR yourself.

martindurant commented 6 years ago

Yes, I considered that, and initially started the work as a PR, but made a separate repo because the main aim of your repo is installable code to be executed by the user, and the main aim of mine is the conda package (or the catalogue file), which the user would reference via Intake and not execute directly at all.

I suppose that there is no deep principled reason that one repo couldn't do both, but you would need a conda publication route here, and I assume you wouldn't want to bother with that.