HDI-Project / ATM

Auto Tune Models - A multi-tenant, multi-data system for automated machine learning (model selection and tuning).
https://hdi-project.github.io/ATM/
MIT License
525 stars 141 forks source link

Does ATM handle multiple rows of the same entity? #118

Closed RogerTangos closed 5 years ago

RogerTangos commented 5 years ago

Hey @micahjsmith , etc, I apologize if this is a bit vague. I figured it'd be better to ask about this in a public forum so that it was well documented.

Is ATM able to handle multiple rows of the same entity? Or do samples need to be flattened into a single row?

As an example use-case, a timeseries dataset might have single entities with multiple observations.

If ATM can handle this, it seems like the entity_id would be contained in an unnamed index column, as shown in the pitchfork_genres.csv example dataset. However, none of the example datasets have multiple rows of the same entity.

micahjsmith commented 5 years ago

Every row of the dataset is a unique entity to make predictions for. But this "entity" can in essence be a compound primary key of other logical entities.

For example, it is totally possible to have a dataset like

person_id date class
0 2018-01-01 red
0 2018-01-02 blue

so multiple predictions can be made for person 0.

RogerTangos commented 5 years ago

Thank you @micahjsmith !

Am I right in thinking that if I wanted to predict a single event using a time-series, then I'd want to add all of my observations into a single row, as such?

person_id date1 data@date1 date2 data@date2 ... class
0 2018-01-01 0.0 2018-01-01 0.1 ... red

And a followup: Is there a way to exclude columns like person_id from the model? Otherwise, I should probably remove the index columns from the example datasets.

micahjsmith commented 5 years ago

Am I right in thinking that if I wanted to predict a single event using a time-series, then I'd want to add all of my observations into a single row, as such?

Yes

Is there a way to exclude columns like person_id from the model? Otherwise, I should probably remove the index columns from the example datasets.

No, currently. Data file is ultimately loaded via https://github.com/HDI-Project/ATM/blob/master/atm/model.py#L87. No custom options can be passed to read_csv, though it would be a nice feature to allow arbitrary read_csv kwargs in run.yaml config file.