State of development? - Githubissues

Weissger commented 4 years ago

After looking into different ontologies used to represent and share information on machine learning experiments, algorithms and tasks, it seems like most of them are not in active development anymore. Is this also the case for ML-Schema? If development was abandoned, it would be great to get your insights why the efforts weren't worth pursuing anymore.

diegoesteves commented 4 years ago

Hi Weissger,

I would not say it was abandoned. It really depends on what you want to accomplish. IMO the main problem that still persists is the fact that there is no easy way to generate the metadata, and researchers don't really have time for that (i.e., a classic problem w.r.t. metadata generation). The (old school) solution for that was the adoption of Workflow systems (that started with Bioinformatic folks), but the trade-off here is that people lose flexibility (i.e., what about a person working just on a Juypter notebook and/or with a specific ML library?). They are still fine if you have a use case that is fully supported by one of those.

I - personally - explored several attempts to bridge this gap (i.e., no workflow system) e.g., 1) Library to generate the metadata more easily (like a logger), 2) then a template-based solution that implements interfaces to deal with the generation. None of those are actually the best solutions and I am not satisfied myself because they still imply some level of engineering added to achieve the main goal (i.e., automatic metadata generation)

In the short-term what I can see as a solution is to implement these vocabularies and have the embedded into existing ML frameworks (e.g., scikit learn). I guess J. Vanschoren did that for Expose, or even ML-Schema? This way one can generate the metadata in a more transparent way.

In the long term, is to create models to read your code and perform this task fully automated. Let me know if you'd like to discuss further :-)

Weissger commented 4 years ago

Hi @diegoesteves,

thank you very much for the quick response! Your invitation to discuss is accepted happily. :-)

I feel in a similar way about the metadata management and the hinderances of their employment based on time constraints for researchers, usability of libraries and templates, technical implications of workflow systems et cetera. I think I might be at an early point of your journey, but with a slightly different approach.

As much as I adore platforms like openml, in practice the loss of flexibility and additional overhead is a price a lot of people are not willing to pay. If we stay in the context of python seamlessly integrated libraries like Sacred "feel" most helpful, when working with them. Sacred unfortunately misses a lot of standardisation and depth in its features. It can't and doesn't want to solve some of the problems related to metadata generation, like semantic harmonization or automatisation. It is only a tool making the manual process easier. In my opinion something like MLflow on the other hand might provide enough benefits, for people to start structuring their python data science experiments a bit more. Nevertheless Mlflow is also very limited in it's interpretation of results and encoding of semantic meaning. Results are more often than not unstructured themselves and the whole project is still in its infancy regarding its tool infrastructure etc. Other current candidates could be found in DVC or also less open projects like wandb.

I don't know if the short-term solution of embedding implementations of these vocabularies into existing ML frameworks is feasible in a large scale. How exactly do you envision this? I'm unsure about the formats OpenML is providing since Exposé is listed as Legacy resource and also the MLSchema example RDF export doesn't seem to adhere to the defined structure anymore.

Currently I'm working on a toolset with a very small team (which certainly also needs some work before it could be considered production ready) to tackle the problem in a hopefully modular and unintrusive way. I envision a python library which just needs to be imported and which tries to track your subsequent code execution based on different means. On the one hand you want to track your results and metadata automatically, without additional effort for the researcher and on the other hand you need to support flexible workarounds for new prototypes, special cases etc. Sometimes tracking the experiment in context of the semantic definitions might not be feasible at all and the toolset shouldn't be a hinderance for the user.

As frameworks, libraries, tools and new versions are springing up like mushrooms in the domain of data science I only see community driven efforts as possible solution. While this sounds nice and practical (it's like outsourcing all the work) some things need to be done to do that successfully. The usability of the approach needs to be great for the enduser - that's a given. The value provided has to be high enough, that people are willing to invest the engineering effort to create an open solution for the community (and or communicate their data) and a price will have to be payed with a tradeoff on data quality (ex. missing entries if something couldn't be extracted).

Our approach tries to decouple the storage logic by using mapping files, duck punching and loggers. Additionally it aims to provide an infrastructure to gather information for a model to "perform the task fully automated" as you hinted before. We think that static code analysis will most likely not include enough information to map custom approaches and the dynamic decision of certain packages (ex. autosklearn) into semantics, reasonably well. Therefore metadata generation has to be decided often on a case by case basis while running the experiment.

What we are currently trying to do is identify or develop an ontology to structure our results by. Unfortunately such an ontology most likely needs to be evolving too or at least have community driven parts.

joaquinvanschoren commented 4 years ago

Hi Thomas, thanks for reaching out. Indeed ML Schema is very much alive. Also, Exposé is still being used, although the owl description is in a dormant state. It’s more of a living schema used within OpenML. The RDF export adheres to ML Schema.

About OpenML: we are currently working on new schema for describing all metadata in OpenML (datasets, tasks, flows, runs), touching base with other initiatives such as Google’s MLMD, MLFlow, frictionless data, the AutoML community etc... you are certainly welcome to join this effort. We'll probably not be able to design a new schema together, but rather we'll release RFCs about the schemas and request the other partners for comments. From the OpenML side, we like to be at least compatible with other schema. We will stay close to ML Schema but also want to move further in certain areas. At the moment the goal is to have something that works with current systems, then hopefully translate some of that back into an ontology or vocabulary.

About complexity: it really depends what you mean with complexity and what your goals are :). If you want semantic meaning and shareability and reproducibility (like we do), then you probably want an opinionated metadata schema. That does incur a burden on the user to understand this schema (compared to schema-less tools like sacred and MLflow). Second, we believe that meta-data collection should be part of the process of running the experiments (the 'system of record' should be tied to the system of execution). If you want reproducibility, you want the system to provide the details about the execution, you should not require the user to provide this: it's another burden and a source for errors

At OpenML we solve both problems through library integrations: library developers can integrate OpenML and provide the necessary metadata, then store it locally or share it globally. Hence, the burden shifts from the user to the library developer. Once the integration is done, however, it will be 0 work for the user. Our sklearn integration is pretty much done, which means that any sklearn-compatible tool could also easily interact with OpenML. We also have extensions for Tensorflow, Torch, MXNet, and fast.ai which are in beta right now.

In the end, there are different tools for different jobs. If you just want to keep track of your own experiments, then sacred/MLFlow or hosted solutions such as Weights&Biases or CometML are fine. If you want to share experiments with others, and make them reproducible to that others can trust the results, then you probably want a more automated and opinionated tool like OpenML (or MLDM specifically for TensorFlow). We hope to build bridges between these tools where possible, e.g. export MLFlow experiment to OpenML.

And yes, OpenML can/should become less intrusive. We're working on that as a community and any ideas are very welcome. E.g. for python we could add decorators as syntactic sugar (but not everyone likes decorators :)). We would also love to support sharing custom code/experiments, eg. with a ‘light vocabulary’ that requires the user to add only the bare minimum info. Happy to talk if you have ideas in that respect.

Love to hear your thoughts.

ML-Schema / core

State of development? #23