amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
468 stars 116 forks source link

General purpose data loaders #133

Open etrain opened 9 years ago

etrain commented 9 years ago

We have included a number of data loaders tailored to standard academic datasets with KeystoneML, but it would be good to include general purpose WAV and image loaders in the project as well.

In particular, much of the work we did with ImageNet involved working around bugs in Java image libraries and some of that work can be repurposed.

tomerk commented 9 years ago

I think there are a few common patterns we've seen so far:

tomerk commented 9 years ago

It's probably best to either encourage storing data in a certain way and pick a single faster pattern to deal with, or to somehow allow mixing and matching among these. Although, I have found that sc.wholetextfiles seems to be slower than sc.textfile, especially when reading from s3 where there was a multiple order of magnitude difference for the newsgroups data which is only several megabytes.

agibsonccc commented 9 years ago

This looks like a great start. I took the following approach: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark/src/main/java/org/deeplearning4j/spark/util/MLLibUtil.java#L112

I assumed a directory structure like mentioned above. In deep learning I typically see images as well as text in directories. Most unstructured data takes some form of a hierarchical storage layout. I'm assuming you guys could use that to your advantage.

The problem you're going to run in to (for desirable patterns) is time series data. For example when working with video encoders (a big part of the problems I typically solve) There's several kinds of ways you can vectorize an image or audio file. It's usually desirable to have in frames.

I'm not sure how far you guys would go with this but it'd be great to see this done right (and in a more integrated fashion)

I personally have to target more platforms than servers (phones are a big one for us) but I'd be happy to share lessons learned or contrib in some way.

etrain commented 9 years ago

In the ImageLoaderUtils class we have a function that takes in a filename and produces a label which is dataset specific (e.g. VOC and ImageNet have different labelsMap functions).

Right now this is built for reading from .tar files with hierarchical layouts embedded in them, but I think we could generalize this to layouts on HDFS. One thing we want to discourage, however, is having lots of tiny files on HDFS, because lots of tiny files really impact HDFS performance, so the current pattern (one tar file per class - or any other sensible way to get a relatively small number of big files) should be encouraged.

Re: time series data/performance - this is probably a separate issue, but we've talked a lot about support for hypercubes as a first-class data structure, both as a local data structure and (eventually) a distributed data structure. Image is an instantiation of this, but the APIs could be much more rich.

shivaram commented 9 years ago

cc @thisisdhaas @sjyk who are also interested in general purpose data loaders for data that comes from SampleClean