Closed dustinvtran closed 7 years ago
doesn’t scikit-learn do that too? maybe you can leverage some of the scikit-learn functions within edward? (and write your own code only for those datasets that are not handled)
Cool! Why not split it out into a separate library? Seems useful outside of edward unless maybe the datasets are coupled w/ the tutorials/examples.
Other comments here: https://twitter.com/dustinvtran/status/874029924150988800
Why not split it out into a separate library?
I wonder this too. I think it would be nice to have somewhere although I don't know where. As Fran notes, Scikit-learn (and Keras and TensorFlow) also have some data set loading utilities, but they're limited and usually tied to a tutorial rather than be an exhaustive resource.
Update: I wrote a fairly generic generator function in the batch training tutorial. It takes a list of NumPy arrays and yields a running minibatch of each array. The code is readable and extends to more personalized setups. It, combined with the newly streamlined (and experimental) TensorFlow input pipeline, should solve most practical concerns about how to batch/preprocess data.
To make real data experiments easy and fast, the remaining utility is a comprehensive set of functions that download, extract, and load standard datasets into memory. This is all the more reason why this issue is important.
Data set loading functions are in a new library: https://github.com/edwardlib/observations.
I spent the past few days writing a set of functions for loading standard data sets. This includes vision (e.g., CIFAR-10, SVHN, small ImageNet), language (e.g., PTB, text8) and general scientific data (e.g., celegans brains, IAM online handwriting, UCI data).
Each function is designed to be minimalistic: it automatically downloads and extracts data from the source if it doesn't already exist; then it loads the data. For example, SVHN looks like
Should these be in Edward? Please comment with your thoughts.