Open pitkant opened 1 year ago
I am adding new issues when there is no answer to the questions here. Whenever the answer is there, then you can write issues there or directly to the dataset repository. The dataset
package has a devel branch which is failing, becuase I am chaning the original 0.2.1. functions rather dramatically and try to get all what is documented in the document there in the 0.3.0 (still this year) or 0.4.0 by January.
Right now, I do not foresee that anybody else then Reprex will add any resources to the dataset package, but I reject that it is a black box. It went through peer review on rOpenSci, it was described at length in the Open Music Europe kick-off and subsequent documents sent to co-developers, and it has a rather detailed functional description above. It should be read together with the other Open Music Europe documents, i.e., the Data Management Plan and Pilot Program for Novel Music Industry Statistical Indicators in the Slovak Republic.
Apart from these documents I highly recommend for reading CSV on the Web Working Group which sets the standards how you can release data (accompanied with metadata in a JSON) on the internet.
These are general observations and questions, not critique. I may have also misunderstood the purpose of this document, which is probably to lay out general level requirements and expectations from technical solutions. However, I am most concerned with specifics of technical solutions and implementation, so I have concentrated on commenting those aspects.
With regards to dataset package, it remains unclear to me who and from which organisation is going to do the actual development work. I could assume that dataset package is a complete black box to me and I have no responsibility in design and implementation with regards to it if developer resources come from e.g. Reprex. However, as some functionalities could concern some functions in the eurostat package, for example if any metadata is saved to an object as attribute data, I have written out my initial thoughts and opinions on the matter.
The problem mentioned in 38 is a real problem but its importance is negligible. csv files are portable, platform agnostic files that are easily shared. Metadata could be included in one way or another and users could be asked to follow certain standards, regarding attribution, data citation etc.
I'm not certain I understand what is meant with "metadata that is required on Zenodo for publication". What are the exact requirements? What does it help if this information is baked into the .rds file? Is this a reference to line 56: using zen4r to publish datasets in Zenodo? Because if the .rds file is dragged and dropped into Zenodo website for publishing, I doubt it can read any metadata from an .rds file.
I like the stated goal of publishing data as a CSV file with metadata as a separate JSON file, although it is unclear to me what fields this metadata JSON should then exactly have. The stated goal of publishing data as an XML file using RDF schema / SDMX semantic standards sounds also feasible, although yet again the exact technical specifications remain unclear to me.
Using .rds files anywhere else but on your local system sound like a bad idea to me. Rds files are binary files that may or may not be portable across different operating systems or R versions.
This seems like a good goal, although it is left unclear to me what the exact technical implementation for recording the metadata would be. If the metadata is recorded as attributes of an R object, then there should be some mechanism to transfer this metadata to the abovementioned JSON file.
What does it mean to "record the codelists"? Does it mean data like this: https://ec.europa.eu/eurostat/cache/metadata/en/cult_emp_esms.htm or the labels of the variable codes such as sex, unit, geo...?
Where is this data saved? What is the field name in the json file?
Fine by me, software citations are very relevant bits of information, but the exact implementation of this remains a bit unclear to me. There has to be a clearly defined field for this sort of information. For example Bibtex fields do not recognise data like "retrieved with", they only recognise the source of the data. If we want to implement such features, we have to be sure that there exists an ecosystem that can read this metadata correctly.
Could I have an example of what is meant with representing "semantic description" and "data structure" in a JSON metadata file? What does "required by FAIR" mean, who is doing the requiring? The European Commission? What exactly are the requirements?