Materials-Data-Science-and-Informatics / MDMC-NEP-top-level-ontology

This repository collects the ongoing work towards the development of the ontology on common terms defined for the MDMC Joint Lab and NEP.
MIT License
1 stars 5 forks source link

Data Analysis Lifecycle #22

Closed rossellaaversa closed 7 months ago

rossellaaversa commented 2 years ago

In the general sense, the steps of the Data Analysis Lifecycle can be combined in chains in different order, really dependent on the use case. It's difficult for the ontology to keep track of all the possible cases. Either we try to keep it more general (having "Data Analysis Lifecycle steps" and "intermediate results") or we try to cover the most common cases, just to ensure that something is not wrong. For instance:

data processing:

data analysis:

data interpretation:

This can be simplified (checking the appropriate definitions):

data processing:

data analysis:

data interpretation:

With this second case, what we have it that the Data Analysis Lifecycle has:

independently of which steps and on which order.

az-ihsan commented 2 years ago

This can be simplified (checking the appropriate definitions):

data processing:

  • input: research data
  • output: research data
  • using: data analysis software

I would say, since we already have ProcessesData as subclass of ResearchData we can put ProcessedData as output of DataProcessing but you are right, we should strictly put the input only to ResearchData which could be ProcessedData, AnalyzedData, ReferenceData, etc.

This applies to other DataAnalysisLifeCycle member.

How about that?

EOsmenaj commented 2 years ago

Hi, just to share with you some definitions that I found really interesting:

Research Data are defined as all information collected, created, or obtained from third parties by researchers to be analyzed with the purpose of generating, verifying and validating original scientific claims, irrespective of their form or the method of data collection. They can be raw, processed to a greater or lesser extent, or analyzed, and can adopt a digital or non digital form.

Raw Data: refers to the unprocessed source data, i.e. research data before being processed or manipulated for analysis. Raw data are the original data obtained directly from an instrument, a survey, the internet, or another source.

Processed Data: these are the research data resulting from any kind of processing or manipulation of raw data – however minimal – in order to prepare them for analysis. Examples of processed data are anonymized data (which no longer contain personal details), cleaned data, annotated data, etc.

Analyzed Data: these are research data resulting from the analysis of the processed data and which can be incorporated or converted into graphs, tables or charts

Source: https://www.ugent.be/en/research/datamanagement/policies/rdm-policy.pdf

EOsmenaj commented 2 years ago

And these from another source: https://vocabularies.cessda.eu/vocabulary/LifecycleEventType?lang=en

DataProcessing Changes, additions or deletions brought to the data at any point in time after collection.

- DataProcessing.Coding

- Data processing: Classification Documenting the meaning/substance of coded data.

- Data processing: Transcriptions of interviews

- Data processing: Weighting

- Data processing: Aggregation

- Data processing: Composite measures

- Data processing: Derivation

- Data processing: Quality checks

- Data processing: Data integration

- Data processing: Disclosure limitation

- Data processing: Imputation

rossellaaversa commented 2 years ago
az-ihsan commented 2 years ago

When we define ResearchData prov:wasGeneratedBy DataAnalysisLifeCycle, actually there is one catch because there is RawData subclass of ResearchData. It infers that RawData prov:wasGenerated DataAnalysisLifeCycle which is wrong according to definition we have that RawData prov:wasGeneratedBy Measurement.

i would suggest instead of ResearchData prov:wasGeneratedBy DataAnalysisLifeCycle, we can specify/add to ProcessedData, AnalysedData, DataInterpretationConclusion prov:wasGeneratedBy DataAnalysisLifeCycle

az-ihsan commented 2 years ago

Could you formulate the ReferenceData definition from this?

rossellaaversa commented 2 years ago

Some suggestions to answer to the questions by @az-ihsan on Miro:

  1. just simply taking the third-party data that is not done in this study? In our context, ReferenceData is ResearchData which is not produced during the current Study, used as reference to compare and/or validate the output of the Study, typically during the Data Analysis Lifecycle.

  2. do comparisons with analysed data that is made for reference data? Depending on the steps of the Data Analysis Lifecycle, it can be used for comparison during Data Processing or Data Analysis or Data Interpretation, we cannot say it in advance. As relationships, we can use the Controlled List Values offered by DataCite:

rossellaaversa commented 2 years ago

For example: scientific publication IsSupplementedBy PublicationData and the other way around: PublicationData IsSupplementTo scientific publication

rossellaaversa commented 2 years ago

Experiment: all the measurements from 1998 to 2019 Measurement: more or less one measurement per day (@mpanighel to clarify further), rather than a measurement for each material

rossellaaversa commented 2 years ago
  • Nice definition of Reference Data can be taken from the definition by @EOsmenaj
  • Nice definition of Processed Data
  • We agree on having just one definition for Data Analysis Software, as it is used also in Data Processing.
  • Software may include two types: Data Analysis Software and Data Acquisition Software. This means that we cannot simply the name --> let's stay with Data Analysis Software, even if the name does not fit the processing
  • TODO for @rossellaaversa

Suggested definition of Data Analysis Software: software used on Research Data during each of the processes included in the Data Analysis Lifecycle (possibly including data rendering, visualisation, plotting) and yelding to Research Data as an output. Depending on the research context, Data Analysis Software can be used during Data Processing, Data Analysis or Data Interpretation, taking as input Raw Data, Processed Data or Analysed Data respectively, and giving as output Processed Data, Analysed Data or conclusions, respectively. If software is used to perform simulations and to generate Raw Data (computer Experiments), it is considered an Instrument and should be described as such.

az-ihsan commented 2 years ago
  • Nice definition of Reference Data can be taken from the definition by @EOsmenaj
  • Nice definition of Processed Data
  • We agree on having just one definition for Data Analysis Software, as it is used also in Data Processing.
  • Software may include two types: Data Analysis Software and Data Acquisition Software. This means that we cannot simply the name --> let's stay with Data Analysis Software, even if the name does not fit the processing
  • TODO for @rossellaaversa

Suggested definition of Data Analysis Software: software used on Research Data during each of the processes included in the Data Analysis Lifecycle (possibly including data rendering, visualisation, plotting) and yelding to Research Data as an output. Depending on the research context, Data Analysis Software can be used during Data Processing, Data Analysis or Data Interpretation, taking as input Raw Data, Processed Data or Analysed Data respectively, and giving as output Processed Data, Analysed Data or conclusions, respectively. If software is used to perform simulations and to generate Raw Data (computer Experiments), it is considered an Instrument and should be described as such.

Shall i update now according to this new definition?

rossellaaversa commented 2 years ago

Could you formulate the ReferenceData definition from this? TODO @EOsmenaj

rossellaaversa commented 2 years ago
  • Nice definition of Reference Data can be taken from the definition by @EOsmenaj
  • Nice definition of Processed Data
  • We agree on having just one definition for Data Analysis Software, as it is used also in Data Processing.
  • Software may include two types: Data Analysis Software and Data Acquisition Software. This means that we cannot simply the name --> let's stay with Data Analysis Software, even if the name does not fit the processing
  • TODO for @rossellaaversa

Suggested definition of Data Analysis Software: software used on Research Data during each of the processes included in the Data Analysis Lifecycle (possibly including data rendering, visualisation, plotting) and yelding to Research Data as an output. Depending on the research context, Data Analysis Software can be used during Data Processing, Data Analysis or Data Interpretation, taking as input Raw Data, Processed Data or Analysed Data respectively, and giving as output Processed Data, Analysed Data or conclusions, respectively. If software is used to perform simulations and to generate Raw Data (computer Experiments), it is considered an Instrument and should be described as such.

Shall i update now according to this new definition?

I would say so, if the others agree. It is almost coherent already

az-ihsan commented 7 months ago

implemented issue closed