Fully machine-readable datasets metadata

KOLANICH commented 5 years ago

I mean I should just call an api to get the datasets on topic, and their description should cointain enough information for domain-specific feature engineering.

And then a call without any manual tuning to do the automatic feature engineering and ML settings and then a call to tune hyperparams and a call for training a model (I have a framework doing the things relying on machine-readable specification of features, though it is very unfinished, but for now enough for my needs).

In other words, if I developed a ml software for some task, I may wanna test it and existing software against as much datasets where this kind software is applicable as possible in order to compare. To do this I need as many datasets as possible with metadata about columns. Detection if a model can be applied to the dataset is simple - if the dataset contains the column of the type learned by that model, the model can be applied to it.

vnmabus commented 5 years ago

Can you please provide an example of a dataset which does not have all the information that you require, and tell us what information you would add? We have tried to provide the API that sklearn.datasets follow, when it was possible, with some additions when the repository provided more information.

KOLANICH commented 5 years ago

Can you please provide an example of a dataset which does not have all the information that you require,

for example this one: https://github.com/daviddiazvico/scikit-datasets/blob/master/skdatasets/ucr.py

and tell us what information you would add

In more details it is described in https://github.com/openml/OpenML/issues/876 :

So let's have more feature types:

{"type":"cyclic", "period": [60, 60, 24, 7]} (period is an array, each element defines a period in counts of previous periods) - anything having some useful to the domain cyclic structure. Enables circle transform.

"survival" - means that survival analysis methods should be applied to the column, if it is target. Otherwise treat as a simple numerical.

{"type":"time", "base":0} - means that it is absolute time. Inherits from cyclic.

"calendartime" - enables feature engineering using calendar features like holidays, festivals, other periodical events dates like Olimpics, Football cup, annual conferences dates, TV schedules, etc...

"location" - enables feature engineering tied to information from maps, like big circle distances, distances to city centre, city, state, distances to POIs of different types, etc

"NLP" - enables NLP feature engineering, like words embeddings and LSTM encoders

"mysteryString" - enables automatic feature extraction from strings which are not natural languages

the ways how exactly the features are processed are implementation defined.

KOLANICH commented 5 years ago

Moved from #9

currently we fetch datasets from every source with custom code. It'd be better to have these all in data, not code, and these data in a separate repo.

we want to have a metadata about the dataset, but some datasets lack them, and some store them in non-standardized way.

we don't want to store datasets ourselves, so we don't want to be burdened with updating them.

such as CRAN, are not in a standardized format, so they are not eligible to be added to Sklearn.

I have implemented a fetcher of RDatasets, but it doesn't solve all the problems.

So the ideas:

create a specification of datasets metainformation format. https://frictionlessdata.io/specs/data-package/ should be helpful.

create the architecture. split fetching from parsing. implement several typical methods of retrieving column metadata, including a one of guessing it from natural language. If metadata cannot be fetched, we would store it in our repo.

create a repo and populate it with descriptions of datasets in the text format specified

transform the metadata from text format into records in sqlite db for rapid search when building the package and embed this db into wheel

We also need some preprocessing spec. I mean I have started some tinkering (for now on survival datasets, since my current task is to do survival regression and I wanna test the code I have written on them (compare its performance with the performance of lifelines and sksurv) ) and found out that the data in datasets is mess (in one dataset the event column means censorship, in another one it means event occurence, and in third one it is categorical and in the forth one there are no durations, but start and stop times, and in fifth one censorship is encoded as negative durations, but the code assumes that one column means event and another one means duration), so I had to implement some heiristical automatic fixers.

we need a spec at least for: 1 remapping columns names 2 simple arithmetics on a columns

There are at least 2 variants: 1 use expressions in text form and parse them with ast.parse and then deal 2 use expressions as a nested structure of dicts representing ast (IMHO we should use this one, what do you think?)

daviddiazvico / scikit-datasets

Fully machine-readable datasets metadata #10