Acceptance Criteria

List the criteria which must be met for the issue to be considered complete

A successful breakout session around Data Modelling organised and run
Documented outputs produced to inform Data Modelling activities
Subsequent issues raised on Data Model implementation

richard-jones commented 2 months ago

@LalithaKambhammettu @npapantonis will identify the stakeholders to be involved in this.

@richard-jones to write a short summary of what we need to achieve by the end

npapantonis commented 1 month ago

Wayne Peters w.peters@imperial.ac.uk to potentially lead the data modelling conversation due to vast experience in supporting this operationally. There has been some discussion around a "generic" baseline set / schema for rich metadata and also we need to keep abreast topical extensions, and how we accommodate these (e.g. Chemistry) as some topics may require different entities and attributes or extended sets. Do we 'call in' separate schemas? How do we map and represent this on the front end? Would it be an initial selection process to determine what is being deposited and then provide the relevant set(s)? There might be an internal requirement for us to reach out to departments in order to understand the variants of topics but we should initially focus on a generic rich baseline.

Attendees: Wayne Peters w.peters@imperial.ac.uk, Nicholas E M Wood nicholas.wood@imperial.ac.uk, Ian J McArdle i.mcardle@imperial.ac.uk; David J Colling, <d.colling@imperial.ac.uk, Lalitha K S Kambhammettu l.kambhammettu@imperial.ac.uk; Trevor C Newbury t.newbury@imperial.ac.uk, Christopher I Cave-Ayland c.cave-ayland@imperial.ac.uk & Noel Papantonis n.papantonis@imperial.ac.uk

richard-jones commented 1 month ago

These are the points I would want to address during the workshop:

Review the object types that will exist in the repository and their properties
Discuss the search and discovery requirements. Presence in OAI and other interfaces, appearance in UI and user search/filter tools, and any other places where the data is available to end users
Any long term requirements for the metadata (e.g. metadata for preservation)
What existing standards are already in use at the College
Where does the content come from, and what (if anything) is produced at source, metadata-wise
Any specific, special requirements for the metadata that we know of?

richard-jones commented 1 month ago

My notes from today:

Agreed that in the first instance Imperial will adopt the DataCite schema and present that to the FAIR data working group
Imperial team will agree with the FAIR data working group their core set of fields
It would also be useful to know at this point the following things about the metadata fields:
- The object type to which they apply (Collection, Dataset)
- Whether the field is searchable (and whether it would need to appear in a facet)
- Where in the various interfaces it should appear (search pages, landing pages, OAI endpoint, etc)
- where the data is sourced from (submission form, automatically generated at source, extracted from data, etc)
- Agreed that software is out of scope for now
- @Steven-Eardley will report back on invenio's capabilities around concept DOIs and DOIs for records which change/have versions
- The existing metadata profile developed at Imperial is attached to this issue for reference

ICDR_MetadataProfile_v1.0.xlsx

richard-jones commented 3 weeks ago

@Steven-Eardley and @richard-jones to have a look at the md profile and review if there are any special requirements for implementation

@npapantonis to check with Wayne about using the shared document: https://imperiallondon.sharepoint.com/:x:/s/Project/FAIR%20Data%20Working%20Group%20Initiative/EUBqFB0mXBRIo6_EFsOvJ-MBpj3ALBv4_Xil9K8qPJASQg?e=lbhSR0

richard-jones commented 2 weeks ago

Speculative date of 23rd September for on-site

richard-jones commented 2 weeks ago

I have done some slightly deeper analysis of the metadata profile, and the mappings across datacite and inveniordm:

https://docs.google.com/spreadsheets/d/106fzB6EiVJmnd3kRmTc1HEkx2h8zTR2S8PFlr-_qMHY/edit?gid=1013395959#gid=1013395959

A couple of questions/points to raise:

PublicationYear in DataCite only accepts the year, while publication_date in Invenio is a full date, so we can store the longer date in Invenio and map to the year if serialising to datacite xml
It would be good to know a little more about what details we'll capture for authors, especially orcids and affiliations. Recommend against capturing "first names" and "family names", and just capture the full "name"
Are there identifiers other than DOIs that we need to consider?
GeoLocation metadata is a bit complicated, do we have specific use cases for it, or do we need to support the full capabilities?
I have made a suggestion as to how to handle temporal coverage: dates/date @dateType="Created|Other" @dateInformation; dateInformation can be used to refine the details of the date, so we could add info there
Is "access level" bibliographic information, or is that security settings for the repository? (i.e. can we omit it from the metadata schema?)
I added the datacite recommended way of identifying an embargo, via date accepted and date available
Is "depositor" bibliographic information or just a repository operational thing? I have made a suggestion of a couple of ways to accommodate this in Datacite if bibliographic
Is "contact information" something that we want to put in the public record? If so, I have made a suggestion to use authorIdentifiers to convey email addresses
The "data access statemen" doesn't sit anywhere obvious in datacite. I have suggested in description, but it is not comfortable.

wayneptr commented 1 week ago

Hi Richard. My responses:

2 The main information for authors will include their name, name identifier (ORCiD), and affiliation (ROR). DataCite and InvenioRDM include Name type (i.e. to differentiate between person or organisation), but not sure we need this on the submission form. If we don't capture "first name" and "family names" will this affect mapping to DataCite?

3 Main identifiers will be DOI, ORCiD and ROR (will we also assign unique system IDs to records for internal use?). Additional identifiers may be needed for Alternate identifiers/Related resources - the InvenioRDM Metadata Reference provides a list of supported identifier schemes.

Also - just to make things more complicated - we would like allow users to publish metadata records for externally hosted datasets, Ideally, there should be an additional field called “Existing DOI.” When populated, would prevent a new DOI being created for the record. When populated, this field would prevent a new DOI from being created for the record. However, not all repositories use DOIs, so we might also need to consider how to enable non-DOI identifiers, such as accession numbers, to be added to the repository record.

4 We don't have specific use cases for it. I'm not sure we need to support full capabilities in the first release, but I might need to raise this with other members of the breakout group. We could just have a free text box that allows depositors to enter geographic region(s) or named place associated with the dataset. This would map to DataCite 18.3 geoLocationPlace and presumably 'place' in the InvenioRDM metadata reference model. Many institutional data repositories (let's call them IDRs) do include an additional field for coordinate values, but not all do.

6 Depositors should have the option to select access level on the submission form (most likely "Open", "Embargoed", "Restricted"). But this would not be displayed on the public metadata record (not sure if this answers your question!). Although obviously the public record will need to inform end users that access is restricted (see comments for 10. below).

8 Some IDRs make depositor information public, but most do not. Therefore, it probably doesn’t need to be on the public metadata record. This information is mainly for cases where the depositor is not one of the listed data creators. In some IDRs, the depositor field is automatically populated from the login details. Is this something we could implement?

9 Again, some IDRs do include contact information on the public record but most don't. The alternative, for restricted datasets, is to publish an admin email address to the public record (see below).

10 One repository added a text prompt so that if ‘restricted’ access is selected, depositors are asked to add details in the description field. Is this something we could do? I haven’t found any examples of submission forms with a separate field for data access statements in the IDRs I reviewed (and I reviewed many). I suspect this is because most of them only accept open access or embargoed datasets. I did find a few repositories that provide managed access to sensitive data. Rather than asking depositors to provide details of access restrictions, if “restricted” is selected, pre-defined text is added to the public record informing the end user that access is restricted, with details of who to contact to request access (e.g., the library/repository admin staff). Is this something we could implement? Here are some examples:

https://researchdata.uwe.ac.uk/id/eprint/703/ https://data.bris.ac.uk/data/dataset/1cq4ulhrjdmpf240uhjb2o6jov https://researchdata.bath.ac.uk/1328/

richard-jones commented 5 days ago

If we don't capture "first name" and "family names" will this affect mapping to DataCite?

No, these fields are optional in DataCite. My recommendation against breaking the name down in this way is because it's slightly artificial for a lot of names, and it's easier not to open that door, just let people tell us their name as they see it.

DataCite does recommend a "given" followed by "first" ordering even in the general name field, but this is advisory only, I believe.

Also - just to make things more complicated - we would like allow users to publish metadata records for externally hosted datasets, Ideally, there should be an additional field called “Existing DOI.” When populated, would prevent a new DOI being created for the record. When populated, this field would prevent a new DOI from being created for the record. However, not all repositories use DOIs, so we might also need to consider how to enable non-DOI identifiers, such as accession numbers, to be added to the repository record.

I believe this is possible using the external type on the pid record, but I will review with @Steven-Eardley and @J4bbi

Some IDRs make depositor information public, but most do not. Therefore, it probably doesn’t need to be on the public metadata record. This information is mainly for cases where the depositor is not one of the listed data creators. In some IDRs, the depositor field is automatically populated from the login details. Is this something we could implement?

Automatically generating makes sense, I will add it to the implementation requirements

One repository added a text prompt so that if ‘restricted’ access is selected, depositors are asked to add details in the description field. Is this something we could do?

Yes, that's a good idea. After discussion with the team here, we think that if we're going to capture this, then custom field for this information is best, so we can ask users to populate that field and then use that to display the information in the relevant places. Otherwise, we can automatically populate the custom field with some default text if the access restrictions are set.

I've added implementation notes to our spreadsheet here https://docs.google.com/spreadsheets/d/106fzB6EiVJmnd3kRmTc1HEkx2h8zTR2S8PFlr-_qMHY/edit?gid=1013395959#gid=1013395959

ImperialCollegeLondon / fair-data-repository

Set up Data Modelling breakouts #40

Acceptance Criteria