dataobservatory-eu / open-music-europe-user-stories

0 stars 2 forks source link

Comments on user story 1 #1

Open pitkant opened 1 year ago

pitkant commented 1 year ago

These are general observations and questions, not critique. I may have also misunderstood the purpose of this document, which is probably to lay out general level requirements and expectations from technical solutions. However, I am most concerned with specifics of technical solutions and implementation, so I have concentrated on commenting those aspects.

With regards to dataset package, it remains unclear to me who and from which organisation is going to do the actual development work. I could assume that dataset package is a complete black box to me and I have no responsibility in design and implementation with regards to it if developer resources come from e.g. Reprex. However, as some functionalities could concern some functions in the eurostat package, for example if any metadata is saved to an object as attribute data, I have written out my initial thoughts and opinions on the matter.

38 Until now, she would save these results into a .csv file and upload it to her research website and Zenodo manually. The proaboutblem with this approach that users who download the csv file from her website do not have a clear idea what these variables stand for, or what is the provenance of the work. Such information is available on Zenodo repository, but users who download the .csv file may forget about it.

The problem mentioned in 38 is a real problem but its importance is negligible. csv files are portable, platform agnostic files that are easily shared. Metadata could be included in one way or another and users could be asked to follow certain standards, regarding attribution, data citation etc.

40 After: Rebeca can save the cult_emp_growth_sex dataset in R into an .rds file that contains all the DataCite or Dublin Core metadata that is required on Zenodo for publication. She can export this cult_emp_growth_sex.rds file into a CSV format that meets the W3C consortium's standard on publishing CSV with machine-readable JSON metadata. She can also serialize for long-term usability the cult_emp_growth_sex it into an RDF schema that contains all the semantic information to connect this dataset to other data that use the SDMX semantic standards, such as other datasets of Eurostat, the World Bank or OECD.

I'm not certain I understand what is meant with "metadata that is required on Zenodo for publication". What are the exact requirements? What does it help if this information is baked into the .rds file? Is this a reference to line 56: using zen4r to publish datasets in Zenodo? Because if the .rds file is dragged and dropped into Zenodo website for publishing, I doubt it can read any metadata from an .rds file.

I like the stated goal of publishing data as a CSV file with metadata as a separate JSON file, although it is unclear to me what fields this metadata JSON should then exactly have. The stated goal of publishing data as an XML file using RDF schema / SDMX semantic standards sounds also feasible, although yet again the exact technical specifications remain unclear to me.

Using .rds files anywhere else but on your local system sound like a bad idea to me. Rds files are binary files that may or may not be portable across different operating systems or R versions.

44 1. The eurostat package retains the provenance metadata, i.e., the descriptive metadata and the semantics of the Eurostat original cult_emp_sex. The dataset package is used by eurostat to record the metadata from the source

This seems like a good goal, although it is left unclear to me what the exact technical implementation for recording the metadata would be. If the metadata is recorded as attributes of an R object, then there should be some mechanism to transfer this metadata to the abovementioned JSON file.

46 2. The eurostat package retains the valid range and the codelist of each variable, in this case, sex, unit, geo, time, and the measured values. The dataset package is used by eurostat to record the codelists from the source

What does it mean to "record the codelists"? Does it mean data like this: https://ec.europa.eu/eurostat/cache/metadata/en/cult_emp_esms.htm or the labels of the variable codes such as sex, unit, geo...?

48 3. 3. Rebeca can add her own descriptive metadata, i.e., as the creator of the derived cult_emp_growth_sex dataset in a way that cult_emp_growth becomes a related item with the derivative work; Rebeca is recorded as the creator.

Where is this data saved? What is the field name in the json file?

50 4. The eurostat package is added to the related items metadata as a software agent that was used in the creation of the cult_emp_growth_sex derived dataset. The dataset package has a function that adds related items metadata.

Fine by me, software citations are very relevant bits of information, but the exact implementation of this remains a bit unclear to me. There has to be a clearly defined field for this sort of information. For example Bibtex fields do not recognise data like "retrieved with", they only recognise the source of the data. If we want to implement such features, we have to be sure that there exists an ecosystem that can read this metadata correctly.

    1. Rebeca can export the cult_emp_growth_sex dataset into a standard CSV file with standard JSON metadata which contains the data, its semantic description, its data structure, the provenance and related items data, and her desciptive metadata as required by FAIR. The dataset package has a release function that can be used.

Could I have an example of what is meant with representing "semantic description" and "data structure" in a JSON metadata file? What does "required by FAIR" mean, who is doing the requiring? The European Commission? What exactly are the requirements?

antaldaniel commented 1 year ago

I am adding new issues when there is no answer to the questions here. Whenever the answer is there, then you can write issues there or directly to the dataset repository. The dataset package has a devel branch which is failing, becuase I am chaning the original 0.2.1. functions rather dramatically and try to get all what is documented in the document there in the 0.3.0 (still this year) or 0.4.0 by January.

Right now, I do not foresee that anybody else then Reprex will add any resources to the dataset package, but I reject that it is a black box. It went through peer review on rOpenSci, it was described at length in the Open Music Europe kick-off and subsequent documents sent to co-developers, and it has a rather detailed functional description above. It should be read together with the other Open Music Europe documents, i.e., the Data Management Plan and Pilot Program for Novel Music Industry Statistical Indicators in the Slovak Republic.

Apart from these documents I highly recommend for reading CSV on the Web Working Group which sets the standards how you can release data (accompanied with metadata in a JSON) on the internet.