Open jeremymanning opened 7 years ago
This is a great idea, but at least for kaggle datasets, they are uploaded by anyone and thus the format varies quite a bit. A separate parser would have to be built for each dataset. We could write parsers for some of the most popular datasets, and then maybe the community would help add more.
I was thinking we could support datasets that were formatted in a particular set of ways (on both repositories). We could also have a hyp.tools.load('datasets', source='kaggle')
(where source
could be one of 'kaggle'
, '538'
, or 'gdrive'
) that returns a list of strings naming parsable datasets. We could either hard-code this in (e.g. have a dictionary of dataset names and locations/formats that we hard-code in for each source) or attempt to do some sort of automatic search. We may also want to be able to return a description of each dataset.
To do the automatic version of this, we'd need a function that detects whether the dataset was formatted in a way that could be parsed correctly by the parser, and if so, that dataset is added to the list of datasets that are "loadable" by hyp.tools.load
. We could have a cron job on one of the lab computers that periodically searches through kaggle/fivethirtyeight and adds any new datasets to a dictionary object that's stored on google drive. then the hyp.tools.load('datasets', source='kaggle')
function would just have to download the list from google drive, rather than actually parsing anything on kaggle.
Create some sort of nice parser or interface for downloading datasets from kaggle and fivethirtyeight and wrangling them into a format that can be used by hypertools.
Example proposed syntax:
data = hyp.tools.load('mushrooms', source='kaggle')