IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
882 stars 494 forks source link

resourceType for dataset files #5086

Open philippconzett opened 6 years ago

philippconzett commented 6 years ago

File DOIs from Dataverse are marked with "Dataset" in DataCite Fabrica, thus in the same way as dataset DOIs are; see this screenshot:

image

According to @pdurbin (cf. this post in the Dataverse Google Group),

"Dataset" is coming from at https://github.com/IQSS/dataverse/blob/v4.9.2/src/main/resources/edu/harvard/iq/dataverse/datacite_metadata_template.xml#L12 which is referenced from https://github.com/IQSS/dataverse/blob/v4.9.2/src/main/java/edu/harvard/iq/dataverse/DOIDataCiteRegisterService.java#L279 . As you can see, it's hard coded to "Dataset". You're saying that for files it should be something other that "Dataset", right? "File" or whatever. If so, can you please open a GitHub issue about this? We recently worked on this part of the code at https://github.com/IQSS/dataverse/pull/4795 for https://github.com/IQSS/dataverse/issues/4782 if you'd like to take a look."

I suggest that the metadata of files in Dataverse be changed, so that their DOIs show up not as "Dataset", but as "Dataset file" in DataCite Fabrica. I'm not sure which metadata field we should use for this. The DataCite metadata field resourceType resourceTypeGeneral is mandatory, and I guess it is the value of this field that is reflected in DataCite Fabrica. But according to the DataCite Metadata Schema 4.0, resourceTypeGeneral can only contain the following controlled list values:

Audiovisual Collection Dataset Event Image InteractiveResource Model PhysicalObject Service Software Sound Text (15) Workflow Other

The list does not contain "Dataset file" or similar. So maybe we just have to specify the field ResourceType, which can contain any value. I suggest a general term like "File", which covers the parts of most types of datasets. Combined with resourceTypeGeneral, we then would get the following resource type description for dataset files:

Dataset/File

where Dataset = resourceTypeGeneral, and File = resourceType.

philippconzett commented 6 years ago

I'm not sure whether I understand your question, @jggautier. But DataCite now displays all our files as datasets in the search engine; cf. . This search results in 1 041 datasets, but we only have 178 datasets. So the rest are files.

jggautier commented 6 years ago

Hi @philippconzett. Did you mean this question?: "Are dataset and file metadata records already sent to EZID/DataCite being updated?"

I referenced this github issue in that issue (#5060), which is about investigating if EZID and DataCite are getting any new metadata that Dataverse sends (as Dataverse changes things like the resourceType values for files) and making sure that the existing metadata records that EZID and DataCite have are updated to reflect those changes. Please let me know if you have any questions.

But DataCite now displays all our files as datasets in the search engine

In the Google Group conversation I thought we were discussing only how the datasets and files were displayed in Fabrica. But here do you mean the list of resource types in DataCite Search?

screen shot 2018-10-05 at 10 40 41 am
philippconzett commented 6 years ago

Hi @jggautier, sorry for the confusion, but I think the display behavior in DataCite Fabrica and in DataCite Search are both based on the Resource type. But I'm not sure whether there is a Resource type = File (or Dataset File) in DataCite. I guess other data repository applications also are interested in getting their file DOIs viewed as files and not as datasets in both DataCite Fabrica and in DataCite Search.

jggautier commented 6 years ago

I agree that in DataCite Search, the resource type is based on the controlled vocab you listed, and there's nothing like file. I like your earlier suggestion:

So maybe we just have to specify the field ResourceType, which can contain any value. I suggest a general term like "File", which covers the parts of most types of datasets. Combined with resourceTypeGeneral, we then would get the following resource type description for dataset files:

Dataset/File

where Dataset = resourceTypeGeneral, and File = resourceType

As long as we don't get too semantic with the word "file," since I imagine some people might ask "what about archived files, like zip files, or things in datasets that are collections of files?" Would you say the value is in being able to, in Dataset Search and Fabrica, distinguish between and filter for datasets versus the things within datasets that have bytes?

We'll have to get DataCite involved, and their metadata team has been responsive during similar conversations about resourceType in their DataCite Metadata forum.

Would you mind writing them about this use case?

philippconzett commented 6 years ago

Thanks, @jggautier, I have raised this issue in the DataCite Metadata forum; see this posting.

mfenner commented 6 years ago

I suggest to distinguish between what can be done with the DataCite Metadata Schema now, and how the metadata schema could be updated in the future (the next schema release for the end of 2018 is basically finalized, so that would be second half of 2019 the earliest).

With the current schema resourceTypeGeneral Dataset is the best fit, and you can add granularity via resourceType (which is a free text field). I like DataFile, but would also consider DataDownload, which is used in DCAT and schema.org: https://schema.org/DataDownload.

pdurbin commented 6 years ago

@mfenner thanks for mentioning DataDownload, which seems like an emerging standard for providing the URLs to download individual files. Last week I wrote about it at https://github.com/whole-tale/whole-tale/issues/35#issuecomment-427411937 in the context of #4371.

philippconzett commented 5 years ago

I just noted that this issue is still discussed also by other users; cf. this thread in the Dataverse Google group.

philippconzett commented 4 years ago

I'd like to urge DataCite (@mfenner) to follow up on this issue. The current situation is quite unsatisfactory as file metadata is confused with dataset metadata, resulting in i.a. a proliferation of file metadata records listed in DataCite Search result lists and ORCID record search result lists.

Currently, DataCite (in DataCite Fabrica) offers the following values for Resource Type General:

image

For files within a dataset, I suggest we use Dataset file or Dataset part or Part of Dataset.

Thanks!

philippconzett commented 4 years ago

See also the the discussion thread Granularity of datasets in the PID Forum.

mfenner commented 4 years ago

@philippconzett you beat me to it, I was just about to post the link.

jggautier commented 2 years ago

I'm helping look into an issue with how the metadata that Dataverse sends to DataCite affects how datasets and files are displayed in an Elsevier product called Data Monitor (https://www.elsevier.com/solutions/data-monitor). Data Monitor apparently grabs from DataCite the metadata of Harvard Dataverse Repository datasets and files (for files that were assigned PIDs before the feature was turned off). And apparently Data Monitor uses some sort of algorithm to figure out which files are parts of which datasets so that it's possible in their product to display only datasets.

I'm planning on contacting Elsevier to find out more and all of this reminded me of this issue. Might be helpful to learn what Elsevier is doing with the DataCite metadata it gets.

philippconzett commented 2 years ago

I guess the answer might be as simple as the file DOIs in Dataverse having the structure

dataset DOI + file suffix

Example from DataverseNO: Dataset DOI: https://doi.org/10.18710/QBSWEH File DOI: https://doi.org/10.18710/QBSWEH/FJR0YN

jggautier commented 2 years ago

Hmm, that might be a factor, too.

Folks at Elsevier confirmed today that from the metadata it gets from DataCite, the "HasPart" relationType in dataset metadata and the "IsPartOf" relationType in file metadata is used to figure out which files are part of which datasets. Doesn't sound very foolproof to me, since a dataset could be a part of another dataset. But maybe publishers aren't sending that kind of relationship metadata to DataCite and maybe despite DataCite's reservations, most publishers are registering DOIs with the kind of structure from your examples.

j-n-c commented 2 years ago

This issue came about on a recent discussion on the community group.

In the (current) latest version of Dataverse Software (5.11.1), resourceTypeGenetal is still hardcoded to Dataset: https://github.com/IQSS/dataverse/blob/develop/src/main/resources/edu/harvard/iq/dataverse/datacite_metadata_template.xml

It would be great if the priority for this issue could be increased to that the interoperability between the Dataverse software and other platforms could be increased

qqmyers commented 2 years ago

Attempting to summarize this issue - there are ~3 proposals for what is needed to have files recognized:

On the community call there was discussion of checking on the proposal for the next DataCite schema to see if something is included w.r.t. a different resourceType for files. If someone checks that, we could either provide feedback on the proposal and/or plan to change the file resourceType when that option is available.

Are there other things proposed here that could/should be acted on?

pdurbin commented 2 years ago

From a quick look at the 4.5 draft at https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/appendices/appendix_1/resourceTypeGeneral.html via the RFC at https://docs.google.com/document/d/1UyQQwtjnu-4_4zXE4TFZ74-mjLZI3NkEf8RrF0WeOdI/edit?usp=sharing we still we still don't have a good way to distinguish between a dataset and a datafile. Here's the list:

Dataset is described like this...

... and it seems in line with what we call a dataset in Dataverse. (Here's a link to the example: https://doi.org/10.1594/PANGAEA.804876 .) It's metadata, and from that metadata you can figure out how to download the actual data files.

If something like DataFile or just Data appeared in the list above, I'd probably say we should use it for files in Dataverse. But there's nothing there so we're sort of stuck unless we get something like DataFile or Data added to the DataCite schema.

Another thought... would it help to use multiple resourceTypeGeneral types for files? That is, send a different resourceTypeGeneral to DataCite based on the file type. Based on the most popular file types in Harvard Dataverse, here's a proposed mapping:

Obviously, this falls down for Data and Tabular Data. I simply put ??? above for those. Here's a screenshot to make this a bit more concrete:

Screen Shot 2022-09-21 at 10 52 40 AM

mfenner commented 2 years ago

The discussion Dataset/Datafile is an older one, going back to for example how schema.org and DCAT handle this. I think it is worth discussing again for the 4.5 schema.

qqmyers commented 2 years ago

@mfenner - what's the best way for us to do this as a community? I think there are multiple people and groups interested in this. I see the online v4.5 material has comment forms. Should we just use those?

philippconzett commented 2 years ago

Glad you are revitalizing this discussion. The issue was recently discussed on a Dataverse community call (see notes) and I also pitched it at the DataCite member meeting earlier this weak.

mfenner commented 2 years ago

I would provide feedback via https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/index.html, there is a comments box at the bottom. I am also interested in this topic via my involvement in the InvenioRDM project.

philippconzett commented 2 years ago

Thanks, @mfenner. Is there a deadline for feedback to be considered for v. 4.5? @qqmyers Maybe we could have a Dataverse Metadata IG call about this?

mfenner commented 2 years ago

I don‘t know the timeline of the 4.5 release, as I am no longer involved.

philippconzett commented 2 years ago

Sorry, @mfenner, I keep forgetting you no longer are at DataCite :-/ I just saw that the Google doc which the GitHub page links to will be open for comment through October 17, 2022.

mfenner commented 2 years ago

No problem, I still care about the DataCite metadata schema, now mainly in the context of my work on InvenioRDM.

pdurbin commented 2 years ago

I was just on a call with @mjbuys and I wanted to ask him about his thoughts on resourceTypeGeneral for files. 😄

Again, my take is adding a very generic type such as "Data" or "DataFile" would help.

(Alongside "Audiovisual", "Book", etc. from the list at https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/appendices/appendix_1/resourceTypeGeneral.html .)

mjbuys commented 2 years ago

Thanks @pdurbin. Unfortunately as this is a proposed change to the schema (rather than comments on the draft 4.5 schema), this would need to be considered for future releases. For context, all changes to the schema go through community validation and extensive discussion in the metadata sub-groups. It would be great if you can submit an idea through our roadmap (https://datacite.org/roadmap.html), see metadata changes at the bottom of each tab.

I am tagging @KellyStathis who leads our work with the metadata working group. It may be that we can address this use case through use of both resourceTypeGeneral and resourceType properties to describe the different entities; and use the relationType field to describe the relationship between these entities (the dataset and the datafile). @KellyStathis what are your thoughts? (@pdurbin Kelly is out through next Monday so you will likely only get a response then. Let me know if this is more pressing).

pdurbin commented 2 years ago

@mjbuys thanks for clarifying about the Oct 17 deadline, that it's to leave comments rather than make proposed changes (late in the game! 😄 ).

A bunch of us just met about metadata and what to send to DataCite in the future. Notes are here: https://docs.google.com/document/d/1tNnvVh8jYY1g53BEwpJmMmm9w6Vgy_Q7RrmFjGnYOyA/edit?usp=sharing

In summary, we're pretty sure we'd like to use our OpenAIRE export as a basis for making improvements to what we send to DataCite.

That doesn't really address this issue (#5086) about files, so I'm getting a little off-topic. 😄 Some day.

Anyway, yes, we'd love to chat more with you and @KellyStathis some day. No, it isn't pressing. 🏖️ Thanks! We'll be in touch!

KellyStathis commented 2 years ago

My initial 2 cents:

Going forward, I see the benefit of having a more structured way to distinguish files (like a specific ResourceTypeGeneral)—among other reasons, because it is important for aggregators to be able to filter these out. As @jggautier mentioned above, HasPart/IsPartOf can can also be used for a dataset that is part of another dataset, so it isn't foolproof. I've saved a link to this discussion in our internal system for tracking schema suggestions, so we can take this suggestion into account for version 5.0. Additional thoughts via our Roadmap are also welcome!

It is also worth considering how this would intersect with the proposed Distribution property in 4.5. At the dataset DOI level, there could be some redundancy between the RelatedIdentifier property (HasPart) and the Distribution property's contents—both of which may include references to file-level DOIs. Discussion about this proposed Distribution property is in https://github.com/datacite/schema-docs/issues/7 and in the RFC Google Doc: DataCite Metadata Schema 4.5: Request for Comments.

I'm also curious if anyone knows of other repository platforms registering DOIs for files, in addition to Dataverse? That would also be helpful for us in understanding the use case. (It sounds like InvenioRDM is interested in this, @mfenner?)

mfenner commented 2 years ago

Thanks @KellyStathis, distribution is basically about the same idea (e.g. dataset and distribution in DCAT), I missed that in my initial comment. The InvenioRDM community is currently mainly focused on launching repositories in production, a DataCite metadata schema change is probably more interesting a bit later, e.g. 2024.