ML-Schema / core

📚 CORE ontology of ML-Schema and mapping to other machine learning vocabularies and ontologies (DMOP, Exposé, OntoDM, and MEX)
http://purl.org/mls
26 stars 7 forks source link

Data Preparation Tasks #27

Open siebert-julien opened 1 year ago

siebert-julien commented 1 year ago

Dear all,

My name is Julien and I am a researcher working for the Fraunhofer Institute for Experimental Software Engineering (IESE) in Kaiserslautern, Germany. I am quite new to the topic of ontologies, so please excuse me if I ask naive questions.

I am interested in ontologie(s) representing data preparation aspects. The underlying context has to do with how preparation tasks influence the quality the prediction and how to reason about it. One can think of missing values, outliers, colinear features, imbalanced features, etc. as data characteristics that can have an impact on the prediction.

I recently started with the state-of-the art (reading published papers), I haven't looked so much yet into the state-of the practice (e.g., getting my hands dirty on some libraries).

My first impression is that existing ontologies seems to be more focused on the prediction part of the data analysis pipelines, is that correct? or am I missing something?

joaquinvanschoren commented 1 year ago

In MLSchema, an 'implementation' can be any complex preprocessing pipeline, but I think that you are right that most ontologies don't express exactly which preprocessing happens in the pipeline.

There are certainly ways to do that, e.g. https://docs.datadrivendiscovery.org/devel/write_pipeline.html https://onnx.ai/sklearn-onnx/auto_tutorial/plot_abegin_convert_pipeline.html https://huggingface.co/docs/optimum/onnxruntime/usage_guides/pipelines https://www.tensorflow.org/tfx/tutorials/tfx/template

Every tool basically uses what works for them, usually based on a DAG. I'm not aware of significant standardization efforts in this area. I would be very interested if you found any :).

siebert-julien commented 1 year ago

@joaquinvanschoren Thank you for your answer. I am now involved in a EU project proposal, I also looked at some state-of-the-art, I also have not seen anything in the direction of standardization. I'll keep looking ;)