Should utilities for loading standard data sets be in Edward?

dustinvtran commented 7 years ago

I spent the past few days writing a set of functions for loading standard data sets. This includes vision (e.g., CIFAR-10, SVHN, small ImageNet), language (e.g., PTB, text8) and general scientific data (e.g., celegans brains, IAM online handwriting, UCI data).

Each function is designed to be minimalistic: it automatically downloads and extracts data from the source if it doesn't already exist; then it loads the data. For example, SVHN looks like

def svhn(path):
  path = os.path.expanduser(path)
  url = 'http://ufldl.stanford.edu/housenumbers/'
  train = 'train_32x32.mat'
  test = 'test_32x32.mat'
  maybe_download_and_extract(path, url + train)
  maybe_download_and_extract(path, url + test)

  loaded = loadmat(os.path.join(path, train))
  x_train = loaded['X'].transpose(3, 0, 1, 2)
  y_train = loaded['y'].flatten()
  y_train[y_train == 10] = 0

  loaded = loadmat(os.path.join(path, test))
  x_test = loaded['X'].transpose(3, 0, 1, 2)
  y_test = loaded['y'].flatten()
  y_test[y_test == 10] = 0

  return (x_train, y_train), (x_test, y_test)

Should these be in Edward? Please comment with your thoughts.

franrruiz commented 7 years ago

doesn’t scikit-learn do that too? maybe you can leverage some of the scikit-learn functions within edward? (and write your own code only for those datasets that are not handled)

patrickeganfoley commented 7 years ago

Cool! Why not split it out into a separate library? Seems useful outside of edward unless maybe the datasets are coupled w/ the tutorials/examples.

dustinvtran commented 7 years ago

Other comments here: https://twitter.com/dustinvtran/status/874029924150988800

Why not split it out into a separate library?

I wonder this too. I think it would be nice to have somewhere although I don't know where. As Fran notes, Scikit-learn (and Keras and TensorFlow) also have some data set loading utilities, but they're limited and usually tied to a tutorial rather than be an exhaustive resource.

dustinvtran commented 7 years ago

Update: I wrote a fairly generic generator function in the batch training tutorial. It takes a list of NumPy arrays and yields a running minibatch of each array. The code is readable and extends to more personalized setups. It, combined with the newly streamlined (and experimental) TensorFlow input pipeline, should solve most practical concerns about how to batch/preprocess data.

To make real data experiments easy and fast, the remaining utility is a comprehensive set of functions that download, extract, and load standard datasets into memory. This is all the more reason why this issue is important.

dustinvtran commented 7 years ago

Data set loading functions are in a new library: https://github.com/edwardlib/observations.

blei-lab / edward

Should utilities for loading standard data sets be in Edward? #675