Closed mcourtot closed 4 years ago
@proccaserra had started to put some notes together and will talk about this next week (June 12th)
@mkoatwork, as discussed today during our squad2 call, here is the first part of my assessment of the dataset you shared with the squad. This initial feedback only covers the way the dataset was made available to us and how this could be improved. As there are issues common to other use cases, I am deriving a recipe for the cookbook (heads up to @susannasansone, @sgtp )
1. Feedback about the dataset distribution itself (provenance, integrity, license, author) integrity, license, author)
[x] List of identified problems:
Absence of resolvable identifier for the dataset itself. suggestion: zenodo API offers the possibility of reserving DOI.
Absence of README file or documentation describing the content of the dataset archive. suggestion: create a manifest file or distribution metadata file using formats such as DATS (DCAT profile), Datacite (via Zenodo). ** Consequence: individual file introspection is needed -- file extension/mime-type based content inference -- Presence of workflow files -- Presence of spreadsheet -- Presence of word document -- A txt file
Absence of open standard formats or declared list of standards used for organising the information. suggestion: Saving a csv file with UTF-8 encoding Knime extension turns out to be a zipped archive of a set of xml files suggestion: convert to an open format (e.g. CWL or WDL) whenever possible
Absence of checksum to ascertain archive integrity and integrity of individual files suggestion: run md5 or sha checksumming
Absence of licensing terms / terms of use suggestion: -> select an license from a list of open licences (as list from github or zenodo) or include the legal documents from ND4DD consortium about condition of use, ideally a resolveable uri to the document
Absence of contact information(implicit, we know it comes from Manfred) suggestion: -> use ORCID to identify authors, contributors (e.g. see zenodo API snippet)
The next section will be dedicated to the content of the dataset itself, which is what is of most interest to Manfred.
@mkoatwork, moving on to the next point of feedback:
2. Documenting the Use Case:
@mkoatwork, now about the content and purpose of the archive itself as shared with FAIRplus Squad2:
3. Feedback about the Workflows, their input, their output.
*3.1. Re-enacting the workflows is not possible (out of the box) because of software change. -- KNIME workflow engine has been updated to version 3.7. Email exchange with Manfred confirmed he ran an earlier version (KNIME 3.5.3), which lead to execution hit and miss. suggestion: provide a container (e.g. Docker) allowing execution in similar conditions to those used by the author. alternative suggestion*: --- if not possible, document precisely environment, version, etc... --- upload to KNIME server and share with collaborators (if granular permission/group sharing is possible)
3.2. Re-enacting the workflows is not possible because of references to local files in the workflow:
e.g. "WARN Create File Name 0:42 Selected directory 'C:\Users\manfred.kohler\Documents\temp\IMI2_FAIRplus_S\ExperimentalData\ExtractPilotDataSets\AMR_DB' cannot be accessed!"
suggestion: input to the workflow should be a permanent, resolvable URI
3.3. Comments about the output/result files:
Excel output lack semantic annotation/semantic markup allowing some level of accessibility to software agent. Worksheet fields are free text, void of semantic anchoring:
The goal of the workflows seems to retrieve/obtain ontology term identifiers when matches are found on strings corresponding to entities such as 'Chemical Class' (e.g. Beta-lactamase inhibitors ), 'Chemical Compound' (e.g. Cefepime), 'Chemical properties (e.g. asphericity) .
suggestion: instead of searching Zooma or NCBO Annotator over all ontologies, restrict the search space by selected domain specific resources (e.g. CHEBI, BAO, CHMO, CHEMINF...)
suggestion: retrieve the persistent resolvable identifier associated with the ontology term label, not just the term label (as currently the case in the workflow output if not mistaken)
suggestion: provide term description + identifier for each of the field headers added to the worksheet by the workflow execution.
(ping @sgtp @SusannaSansone @oyadenizbeyan @mcourtot )
@proccaserra Thanks Philippe for you comments. There are a lot of topics which hopefully might initiate a discussion (what was my original intention when I provided this data set).
Absence of resolvable identifier for the dataset itself. As I mentioned in my "recipe" I would suggest to add a DOI (for me a DOI is similar to persistent identifier PID) in one of the last steps before making the data available to the public. Otherwise we would need to apply versioning to the dataset to make users aware of 'datasets in progress'. For me 'R' should be solved before 'I', 'A' and 'F' as it makes no sense to have a dataset findable but not accessible or re-usable. A better way would be to call it RIAF instead of FAIR ;-)
Absence of README file That's a good suggestion. This should be an early step in a recipe. I'll add a README file asap (see also below).
Absence of open standard formats or declared list of standards Normaly I would agree to save as CSV files instead of Excel format, but most of our customers like to have the result files as MS Office files as they have frequently problems with uploading data depending of the language settings of their PCs. That's the reason why we have to work with Excel files ;-) I'm not really happy about this.
Absence of checksum to ascertain archive integrity and integrity of individual files Again a good suggestion for the recipe. My proposal is to add this information to the README file (see above and below)
Absence of licensing terms / terms of use Also I mentioned a license in my "recipe", I'm now uncertain why we need a license at all. For hundreds of years (at least since famous Newton) it was sufficient to cite e.g. data of a publication to use the data or even to criticize the data as insufficient or incorrect. Why do we need a license today? Is there a legal requirement? What is the rationale behind using a license? By the way, I don’t think that cc is a good choice as some of the data sets are created by using a predefined protocol and this is not very creative in my understanding of cc.
Documenting the Use Case: - What is being sought? Simply making the data FAIR - Why are you building these workflows? To find approriate ontologies and link the metadata to the data provided. - What are the resources you wish to 'integrate' with ? Do I need to define the resources in advance? I thought making data FAIR should be independent from use case?
Feedback about the Workflows, their input, their output.
3.1. Re-enacting the workflows is not possible (out of the box) because of software change. Good suggestion! Should go into the recipe. I will add this to the README file. Suggestion: We should think of a e.g. tab separated file with the following information: File \t Extension/MIME Type \t MD5 checksum \t Software used to generate File \t Version of software for simple machine readability
3.2 Re-enacting the workflows is not possible because of references to local files in the workflow Therefore I added the Create File Name node to set a different local directory. All files used in the workflow are attached. At this time point of FAIRIfying the data set a permanent, resolvable URI would result in a versioning (see above)
3.3 Comments about the output/result files: suggestion: instead of searching Zooma or NCBO Annotator over all ontologies, restrict the search space by selected domain specific resources (e.g. CHEBI, BAO, CHMO, CHEMINF...) This would require special domain expertise on ontologies. My suggestion was to create a workflow using this services to find the most appropriate ontology. My vision is to make FAIRification a process which could even be initiated by lab personal. If we restrict the annotation of data sets to experts only we will not see a lot of datasets be FAIRified due to limited number of experts.
suggestion: retrieve the persistent resolvable identifier associated with the ontology term label, not just the term label (as currently the case in the workflow output if not mistaken) Of course, for the final annotation of the original terms in the data set I would suggest to add the URI instead of the term together with the version of the ontology used for annotation. But there are some more questions arising when it comes to annotation like: Should the annotation go into a separate column? What about multiple annotations for one parameter (e.g. 'Average % inhibition' this will result in three ontology terms as there is no single term defined)?
suggestion: provide term description + identifier for each of the field headers added to the worksheet by the workflow execution. see above
@proccaserra is lifting this report into the final report on ND4BB dataset, https://docs.google.com/document/d/1iBNsxBg27Ak1Ysg63dcImeX3PZnDxeZu7uLcS5uStvw/edit#
ND4BB recipe v2, based on Manfred's original report.
Changes:
V2 was also added to the FAIR cookbook page. https://fairplus.github.io/the-fair-cookbook/recipes/nd4bb_raw/FAIRification_CookBook_Recipe1_V02.html
@mko sent early on: https://drive.google.com/open?id=1dudvrG-dtfwm0fseQ1HAFwSMujSuSwLD
@oyadenizbeyan did some evaluation of the workflows, @mkoatwork would like to know if the process he followed is a good start or not. Specific interest in identifiers Question about what is metadata - "provenance" vs metadata about the data which is inside the dataset, eg identifiers, column names etc