IBM / datascienceontology

Data Science Ontology
https://www.datascienceontology.org
Creative Commons Attribution 4.0 International
36 stars 14 forks source link

Type concept for dimensionality reduction? #2

Open ioana-blue opened 5 years ago

ioana-blue commented 5 years ago

Right now there is a distinction between feature-extraction and feature-extraction-model (curious on the distinction, I think I get it: one is the general method, the other one is a model that reifies the method).

There is a concept for dimension-reduction-model (name is missing model, I'll add that in a PR). But there is no concept for dimension-reduction (which would follow the lines of the distinction between method and reifying model present in the example above for feature extraction).

Is this omission intentional? Or is this a point-in-time POC situation - dimension-reduction should be added?

Corresponding items in the DSO browser: https://www.datascienceontology.org/concept/dimension-reduction-model https://www.datascienceontology.org/concept/feature-extraction-model https://www.datascienceontology.org/concept/feature-extraction

epatters commented 5 years ago

Good question. The omission is intentional, though the terminology is debatable and the documentation could certainly be improved.

In the ontology, the concept model is specifically a statistical model, meaning it is a function of the data (is fit to the data), as explained here:

https://www.datascienceontology.org/concept/model

The distinction, then, is between feature extraction methods that are models--need to be fit to the data--and those that are not. Examples of the former are dimension reduction models like PCA, CCA, etc. You can't fully specify the feature extraction performed by these models before seeing the data. An example of the latter is the dummy (one-hot) encoding of categorical variables. Once you know the schema of the data, the encoding is fully specified. You don't need to see any actual data.

ioana-blue commented 5 years ago

Thanks for the explanation, I think I got it.

Subtleties like this generally concern me for the following reason: if you randomly select a data scientist, do you expect them to understand the difference between feature extraction models and feature extraction [methods] (that could be models :) )?

More questions/comments:

epatters commented 5 years ago

To your specific questions:

To the bigger question about whether these distinctions are too subtle, I think a fairly simple UI improvement to the DSO frontend would help. Most API docs list all the methods of a class on the page for that class. Likewise, on the page for type concept, we could list some/all of the functions that take it as input. The fit function would appear on the model page, making it immediately clear what you can do with a model.

ioana-blue commented 5 years ago

I'm looking more at the hierarchy we have in place:

feature-extraction isA transformer (which currently is root/has no parent) feature-extraction-model isA feature-extraction, transformation-model

So it looks to me that the way things are now, feature-extraction could be either feature-extraction-method (which doesn't exist now in the DSO) or feature-extraction-model. Similarly, transformer seems to cover both models and methods.

To answer my own question above on feature-extraction, it shouldn't be renamed feature-extraction-method (since now it's set to cover both). So your answer above is a bit confusing now :)

If we add feature-extraction-method and have a distinction between methods and models, we would have:

feature-extraction-method isA feature-extraction, transformation-method (to be symmetrical to the model side we have in place for feature-extraction) transformation-method isA transformer, method

Thoughts?

epatters commented 5 years ago

Thanks for looking more closely. On further reflection, I don't see a reason to have "intersection" types like feature-extraction-model when the nomenclature of the derived type doesn't add anything to the original two types, because you can simply say that a type is-a feature-extraction and is-a model. (We support "multiple inheritance.")

There is perhaps value in creating a root-level type method and asserting that model is-a method. Then a method would mean something like "a well-defined procedure or algorithmic technique that operates on data". Not sure whether that level of abstraction is helpful or whether the nomenclature is ideal.

ioana-blue commented 5 years ago

ok, now I can follow :) and I agree. so open issue(s) to fix the current state?

can we also start to collect some "principles of DSO design" in an issue? what you said above is one such principle IMO. let's keep track why we do what we do (make it principled). once we have enough of them, we'll add them to docs.

epatters commented 5 years ago

I'll make a PR.

For design principles, maybe start with a wiki page? Or do you think an issue is better so we can discuss?

ioana-blue commented 5 years ago

either is fine. I like issues for collecting/discussing stuff, wikis feel more permanent (on the other hand, there is a greek saying along the lines of "there is nothing more permanent than a temporary solution" :P). whatever is easier to transform into docs later on (wiki? :) ).

epatters commented 5 years ago

I'll start with an issue. These "principles" (dare I call them that?) are hardly set in stone at this point.