Open KOLANICH opened 5 years ago
Can you please provide an example of a dataset which does not have all the information that you require, and tell us what information you would add? We have tried to provide the API that sklearn.datasets follow, when it was possible, with some additions when the repository provided more information.
Can you please provide an example of a dataset which does not have all the information that you require,
for example this one: https://github.com/daviddiazvico/scikit-datasets/blob/master/skdatasets/ucr.py
and tell us what information you would add
In more details it is described in https://github.com/openml/OpenML/issues/876 :
So let's have more feature types:
{"type":"cyclic", "period": [60, 60, 24, 7]}
(period
is an array, each element defines a period in counts of previous periods) - anything having some useful to the domain cyclic structure. Enables circle transform."survival"
- means that survival analysis methods should be applied to the column, if it is target. Otherwise treat as a simple numerical.{"type":"time", "base":0}
- means that it is absolute time. Inherits fromcyclic
."calendartime"
- enables feature engineering using calendar features like holidays, festivals, other periodical events dates like Olimpics, Football cup, annual conferences dates, TV schedules, etc..."location"
- enables feature engineering tied to information from maps, like big circle distances, distances to city centre, city, state, distances to POIs of different types, etc
"NLP"
- enables NLP feature engineering, like words embeddings and LSTM encoders"mysteryString"
- enables automatic feature extraction from strings which are not natural languagesthe ways how exactly the features are processed are implementation defined.
Moved from #9
- currently we fetch datasets from every source with custom code. It'd be better to have these all in data, not code, and these data in a separate repo.
- we want to have a metadata about the dataset, but some datasets lack them, and some store them in non-standardized way.
- we don't want to store datasets ourselves, so we don't want to be burdened with updating them.
such as CRAN, are not in a standardized format, so they are not eligible to be added to Sklearn.
I have implemented a fetcher of RDatasets, but it doesn't solve all the problems.
So the ideas:
- create a specification of datasets metainformation format. https://frictionlessdata.io/specs/data-package/ should be helpful.
- create the architecture. split fetching from parsing. implement several typical methods of retrieving column metadata, including a one of guessing it from natural language. If metadata cannot be fetched, we would store it in our repo.
- create a repo and populate it with descriptions of datasets in the text format specified
- transform the metadata from text format into records in sqlite db for rapid search when building the package and embed this db into wheel
We also need some preprocessing spec. I mean I have started some tinkering (for now on survival datasets, since my current task is to do survival regression and I wanna test the code I have written on them (compare its performance with the performance of
lifelines
andsksurv
) ) and found out that the data in datasets is mess (in one dataset the event column means censorship, in another one it means event occurence, and in third one it is categorical and in the forth one there are no durations, but start and stop times, and in fifth one censorship is encoded as negative durations, but the code assumes that one column means event and another one means duration), so I had to implement some heiristical automatic fixers.we need a spec at least for: 1 remapping columns names 2 simple arithmetics on a columns
There are at least 2 variants: 1 use expressions in text form and parse them with
ast.parse
and then deal 2 use expressions as a nested structure of dicts representing ast (IMHO we should use this one, what do you think?)
I mean I should just call an api to get the datasets on topic, and their description should cointain enough information for domain-specific feature engineering.
And then a call without any manual tuning to do the automatic feature engineering and ML settings and then a call to tune hyperparams and a call for training a model (I have a framework doing the things relying on machine-readable specification of features, though it is very unfinished, but for now enough for my needs).
In other words, if I developed a ml software for some task, I may wanna test it and existing software against as much datasets where this kind software is applicable as possible in order to compare. To do this I need as many datasets as possible with metadata about columns. Detection if a model can be applied to the dataset is simple - if the dataset contains the column of the type learned by that model, the model can be applied to it.