IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
882 stars 493 forks source link

Feature Request/Idea: RO-Crate support #8688

Open beepsoft opened 2 years ago

beepsoft commented 2 years ago

As a followup to the discussions on Element with @pdurbin and @poikilotherm I would like to start a discussion here on the possible support of RO-Crate (https://www.researchobject.org/ro-crate/) in Dataverse.

I found these slides https://zenodo.org/record/4973678 and recording https://www.youtube.com/watch?v=LJq-mzT9v8o&t=1731s of the Dataverse Community Meeting from 2021 where Stian Soiland-Reyes discusses possibilities of RO-Crate export/import in Dataverse.

I was wondering if there is any followup to this presentation and whether there are official or community plans to support RO-Crate?

Let me give you a little bit of background how we imagine using a RO-Crate enabled Dataverse.

We are working on a new system built around Dataverse, where we would like to support RO-Crate as a dataset input format besides the usual DV way of uploading and metadata annotating of datasets.

RO-Crate and Dataverse

Ideally our system would allow uploading and ingesting RO-Crate packages (eg. as .zip or BagIt) in Dataverse. For creating RO-Crates we plan to provide a RO-Crate Editor[1] but the RO-Crates can be assembled by users using any tool they see fit. The RO-Crates to be ingestible by Dataverse must be accompanied with metadata using schemas, which are understandable by Dataverse, therefore both the RO-Crate Editor and Dataverse must use the same schemas. As of now Dataverse provides out of the box 15 such schemas as "metadata blocks", so these schemas should be available to the RO-Crate Editor as well. In our system we would like to have an external system called "Schema registry" for storing these schemas and we imagine that these schemas would then be uploaded to and configured in both Dataverse and the RO-Crate Editor so that they are compatible when working with the metadata in the RO-Crates.

As we are building our system based on RO-Crate, we would be happy to work on or help in RO-Crate integration in Dataverse, but it would be good to know if there's something already implemented in this regard or if this idea is supported by IQSS or the DV community at all. Also pinging @qqmyers as suggested on Element to be interested in RO-Crate support as well.

[1] For the RO-Crate editor we are now investigating https://github.com/Arkisto-Platform/describo-online

poikilotherm commented 2 years ago

Today at the HMC Conference 2022 I learned about https://github.com/kit-data-manager/ro-crate-java

Looks like this might be helpful to deal with RO-Crates programmatically.

pdurbin commented 1 year ago

Today I learned that .eln uses RO-Crate:

pdurbin commented 1 year ago

Check out the RO-Crate file that is now downloadable from a .eln (zip) file:

DieuwertjeBloemen commented 1 year ago

Has anyone started working on the more general RO-Crate support for Dataverse already? Because it would be something we would like to work on but don't want to duplicate anyone's work.

beepsoft commented 1 year ago

The problem is, how we define "general RO-Crate support".

We currently have a solution, which works based on the Dataverse metadatablocks as schemas, but RO-Crate suggests the use of Schema.org, while allowing the use any other schema as well.

So, RO-Crate support in Dataverse can mean two things:

  1. Mapping current MDB values to some feasible Schema.org class/property
  2. Generating RO-Crate metadata using the MDBs as the schemas.

We have a solution for 2, where we use the required Schema.org Dataset for Root Data Entity and File/Dataset for Data Entities but use properties and classes of MDB-s for Contextual Entities

For a greater RO-Crate audience probably solution 1 would be welcome, but that would be a lossy conversion from MDB data to RO-Crate as not all MDB field/type might be mapped to a Schema.org value. Solution 2 provides import/export between Dataverse instances but may not be processable by other RO-Crate tools, which expect Schema.org based values.

DieuwertjeBloemen commented 1 year ago

Good to know. I had a discussion about RO Crate and repository support this Monday with Stian Soiland-Reyes and Marc Portier from the RO-Crate initiative about how repository support should/could look. But it would be something we could figure out in the context of this. E.g. uploading an RO Crate with the accompanying files and being able to extract the structure and metadata from it could be interesting, but we would have to look into how to export it afterwards without data loss from the extraction etc. In other words, there is some preparatory brain-storming necessary for the entire picture, but we would love to pick that up in part if no one else is working on that right now. Worst-case scenario; the conclusion is that it's not possible, but then at least we've given it a try.

pdurbin commented 1 year ago

RO-Crate suggests the use of Schema.org,

Could there or should there be any overlap with Croissant, which also builds on Schema.org?

Update: see also:

DieuwertjeBloemen commented 1 year ago

I'll repeat the post from the Google Group here as well, so everyone involved on the RO-Crate work up until now is also aware of this development:

At KU Leuven, we just received some great news from the FAIR-IMPACT project. We’ve been selected as one of 15 teams for the “Enabling FAIR Signposting and RO-Crate forcontent/metadata discovery and consumption” support action. After applying for it in June as suggested by Philipp Conzett (thanks Philipp :) ).

We’re hoping to do some work on improving and expanding the integration of RO-Crate with Dataverse. Our first job will be to figure out what is possible (cf. Issue #8688), but hopefully for this short project, we’ll be able to get started on a useful addition to the Dataverse project. We’ll keep you posted if anything is finished or when we want or might need some input on what the needs/wants are of the community.

If you have any input already, you can leave it here or add on to the Issue in Github as more input and ideas are always welcome.

Kind regards,

KU Leuven RDR Team (Kris, Eryk, Özgür and Dieuwertje)_

beepsoft commented 1 year ago

We are also selected in FAIR-IMPACT support action. :-)

ptsefton commented 1 year ago

As well as Describo Online I think you should look at Crate-O - which is being developed by my team at the University of Queensland. https://github.com/Language-Research-Technology/crate-o: this is similar to Describo Online (and the other multiple variants of Describo) in some ways but solves some issues that we had with that project. Happy to discuss with you why we chose to develop a new tool and how it might fit with Dataverse.

marcolarosa commented 1 year ago

Up to date information about the Describo environment can be found @ https://describo.github.io/#/. Earlier implementations were proofs of concept that had many design issues and so are no longer supported.

@beepsoft and team have created an implementation of Describo RO Crate editing in Dataverse. A short intro is at https://describo.github.io/#/describo-users.

ptsefton commented 1 year ago

OK, good to hear that this work is already under way. Sounds like Crate-O is not needed here at the moment.

pdurbin commented 1 year ago

New PR by @beepsoft:

Great stuff! ❤️ 🚀 🎉

ptsefton commented 5 months ago

I'm preparing a talk for the Open Repositories conference in June - is there an update on the progress on this feature or other RO-Crate support in Dataverse @beepsoft

beepsoft commented 5 months ago

I'm preparing a talk for the Open Repositories conference in June - is there an update on the progress on this feature or other RO-Crate support in Dataverse @beepsoft

It is still a pending PR, and there is no word yet on merging it or reworking it in other ways. However, it is a functional RO-Crate exporter implementation nonetheless.

DieuwertjeBloemen commented 5 months ago

I'm preparing a talk for the Open Repositories conference in June - is there an update on the progress on this feature or other RO-Crate support in Dataverse @beepsoft

As an FYI: there's another Dataverse RO-Crate exporter PR available (slightly different use case, but also developed in the FAIR-IMPACT support call): https://github.com/gdcc/dataverse-exporters/pull/15 . I'll be at the open repositories conference and do a talk on our work on this exporter, so maybe I'll see you there.

stain commented 5 months ago

As an FYI: there's another Dataverse RO-Crate exporter PR available (slightly different use case, but also developed in the FAIR-IMPACT support call): https://github.com/gdcc/dataverse-exporters/pull/15 . I'll be at the open repositories conference and do a talk on our work on this exporter, so maybe I'll see you there.

@DieuwertjeBloemen do you have the slides from this talk published somewhere? Would be great to link to from RO-Crate website!

What do we need to do to get this merged?

pdurbin commented 5 months ago

@stain I'm not sure but I just offered to help @okaradeniz at https://github.com/gdcc/dataverse-exporters/pull/15#issuecomment-2154803999 . Usually I ping @cmbz @scolapasta to advise about priorities.

@beepsoft there's also your pull request at #10086 that isn't marked as closing this issue (#8688). Should it? And how do you feel about your pull request vs. the one by @okaradeniz?

cmbz commented 5 months ago

Hi @stain the issue has already been prioritized. It's just waiting for the work currently in Sprint Ready to clear out so it can be added to the queue.

DieuwertjeBloemen commented 5 months ago

@stain The slides of the presentation are going to go on Zenodo as far as Open Repositories said. Once I see them appear there, I'll drop the doi here.

pdurbin commented 5 months ago

@beepsoft as I just mentioned at https://github.com/gdcc/dataverse-exporters/pull/15#issuecomment-2158481171 I just created a new dedicate repo for @okaradeniz at https://github.com/gdcc/exporter-ro-crate

Perhaps the two of you could collaborate on a single RO-Crate exporter?

Please let us know what you think! We can also talk it out on Zulip: https://dataverse.zulipchat.com/#narrow/stream/379673-dev/topic/RO-Crate/near/393962020

DieuwertjeBloemen commented 5 months ago

@pdurbin I'm not sure if merging the two is possible, as they have quite different set-ups and use cases. @beepsoft or @okaradeniz, correct me if I'm wrong and you do see this as possible. In my eyes, they're two different implementations of an RO-Crate exporter and can perhaps both be offered separately as external exporters so installations can choose based on what implementation makes most sense to them (what we might have to collaborate on is some explanation on the difference between the two so the choice is more transparent). Of course, that's up to @beepsoft to see if he has the time to set his work up like that as well.

beepsoft commented 5 months ago

@beepsoft as I just mentioned at gdcc/dataverse-exporters#15 (comment) I just created a new dedicate repo for @okaradeniz at https://github.com/gdcc/exporter-ro-crate

I cannot access this repo, I get "This repository is empty." error.

pdurbin commented 5 months ago

@beepsoft ah, sorry, yes https://github.com/gdcc/exporter-ro-crate is currently empty but the idea is that @okaradeniz will push these files to it: https://github.com/gdcc/dataverse-exporters/pull/15/files

okaradeniz commented 5 months ago

@pdurbin I just pushed the initial commit, thanks again for the repository. The exporter needs some more work before publishing on Maven, which I'll start next week as soon as I finish some other work.

I also agree @DieuwertjeBloemen that the exporters seem to differ in many aspects, at least in their current states, but we can work with @beepsoft on clarifying what they offer differently.

beepsoft commented 5 months ago

Thanks @okaradeniz!

The two main differences between your implementation and ours as I see:

I think your implementation is more flexible both in terms of being implemented as an exporter plugin and also with the dataverse2ro-crate.csv mapping approach.

I think a default behaviour of your implementation could be to work without dataverse2ro-crate.csv and use the MDB and dataset field names and URI-s as is in the RO-Crate. This would result in something similar to what we have in our implementation. And if someone needs customization, they could add a proper dataverse2ro-crate.csv.

My only concern with dataverse2ro-crate.csv is whether it is flexible enough for all mapping use cases. I haven't thought it through yet, but you have probably put more thought into what it is or isn't capable of.

okaradeniz commented 5 months ago

An additional major difference in my mind is how the two exporters approach the problem of mapping Dataverse metadata blocks to the properties in the ro-crate-metadata.json.

The one offered in @beepsoft's PR follows this part of the ro-crate specification:

However, as RO-Crate uses the Linked Data principles, adopters of RO-Crate are free to supplement RO-Crate using Schema.org metadata and/or assertions using other Linked Data vocabularies.

So it includes the dataverse installation as a vocabulary in order to be able to use the metadata blocks in the resulting json:

9721b391-a9fd-43be-9dd3-216848ea3a39

A disadvantage resulting from this is that fields that are perfectly mappable to schema.org based ro-crate properties are taken directly as they appear in the dataverse metadatablocks (such as title instead of name) by adding dataverse.org or the dataverse installation to the @ context.

We took a different approach, by letting installations choose how they want the metadata to map to the properties in the export. The advantage is that installations have both flexibility and compatibility with the ro-crate specifications, and compliance with Schema.org. The disadvantage is that it needs more work from the installation if they want to customize it.

beepsoft commented 5 months ago

I completely agree with you that mapping to Schema.org vocabulary is really useful. I was just wondering whether it would be possible to have a default, which falls back to using metadatablock definitions (names, URI-s) if no dataverse2ro-crate.csv mapping is provided? This way, you could both customize your RO-Crate the way you want but don't need to bother with other metadatablock that are, say, local to you installation or already have URI-s from well known vocabularies (eg. Dublin Core). All this could be somehow configurable via dataverse2ro-crate.csv as well.

okaradeniz commented 5 months ago

We will have a default csv file that comes with the exporter, with mappings from the default dataverse metadata blocks to the schema.org based properties in the ro-crate specification. That way, the exporter can support the ro-crate specs and comply with schema.org as much as possible out-of-the-box. Your suggestion would definitely help the exporter to cover the default and custom metadata, but it also would hardcode a default behavior that would (in many cases I think) result in exports that aren’t compliant with the ro-crate specs.

beepsoft commented 5 months ago

What RO-Crate compliance problems do you see here?

As I understand, whatever can be mapped to Schema.org will be mapped, and the rest could use @context URI-s from other vocabularies, which is allowed and supported by (JSON-LD and therefore) RO-Crate. That's a different issue that these custom or not well known properties may not be automatically interpreted/imported by other RO-Crate systems, but we could still make all Dataverse data available in the RO-Crate JSON for those that can or willing to handle them.

okaradeniz commented 5 months ago

So it would not contain the installation's URL as a vocabulary, perhaps by filtering out metadata that doesn't comply with a known vocabulary? In that case it would comply with ro-crate because of the statement in the specification that allows other properties and vocabularies when necessary, however not readable by other systems, as you pointed.

I think that a script can be included, which installations can use to produce a new csv from their metadata blocks to replace the default one if they want. This could let installations cover all their metadata if they’re ok with the interoperability trade-off, without having it as a default behavior.

qqmyers commented 5 months ago

FWIW: Dataverse uses custom URIs (as in the OAI-ORE export file) when the metadatablock has no mapping defined for a field. If there are valid mappings available, they could/should be added to the blocks themselves. For the citation block, we tried to map as many fields as possible (sometimes rejecting a mapping because the range of an allowed mapping was wrong rather than a concept) and we didn't necessarily prioritize schema.org over other options (e.g. versus Dublin Core), but for other blocks, I don't know that there's been much effort to add mappings at all yet (maybe the software one). In any case, if there are non-controversial mappings that can replace our custom terms, or schema.org has a better replacement for other mappings we currently use, I'd encourage adding them to the blocks and not just keeping them in one exporter. (Conversely, if a mapping depends on having curators assure that some field is used in the right way to make sure the range of values fits the schema, or if the mapping is lossy in some way (multiple fields mapping to one term), keeping the mapping in the exporter/making it configurable is probably the better approach). (Same thing applies to values in the dvcore namespace - if there are good mappings to external vocabs, we should use them.

beepsoft commented 4 months ago

I think that a script can be included, which installations can use to produce a new csv from their metadata blocks to replace the default one if they want. This could let installations cover all their metadata if they’re ok with the interoperability trade-off, without having it as a default behavior.

Such a script would be great, but my idea is that this could be done automatically by the exporter on the fly as well. Having this done by the exporter automatically would ensure that if a new MDB is added or anything is changed with MDBs, the changes are reflected right away in the RO-Crate.

The exporter could have 3 mode of operations (configuration could be set via either dataverse2ro-crate.csv or some other configuration file, or via SettingsServiceBean if it was exposed via the Exporter interface)

  1. Convert everything from a Dataset as-is using the field names and URIs assigned to dataset field types.
  2. Option 1, but with dataverse2ro-crate.csv taking precedence, i.e., overriding what the automatic solution would do.
  3. Conversion based on only dataverse2ro-crate.csv, no automation whatsoever.

What do you think, @okaradeniz?

okaradeniz commented 4 months ago

Thanks @beepsoft. We wanted to give installations flexibility over exports with the CSV, but we think that changing metadata in each export would undermine interoperability. For now we’ll keep on working on the current approach, fixing some issues and improving the default CSV.

DieuwertjeBloemen commented 4 months ago

As an FYI: there's another Dataverse RO-Crate exporter PR available (slightly different use case, but also developed in the FAIR-IMPACT support call): https://github.com/gdcc/dataverse-exporters/pull/15 . I'll be at the open repositories conference and do a talk on our work on this exporter, so maybe I'll see you there.

@DieuwertjeBloemen do you have the slides from this talk published somewhere? Would be great to link to from RO-Crate website!

What do we need to do to get this merged?

Hi @stain, the Open Repositories powerpoint presentations have been posted in Zenodo, and ours is available at: doi:10.5281/zenodo.12548334

ErykKul commented 4 months ago

@beepsoft I have tried porting this exporter to the Dataverse Transformer Exporter: ARP RO-Crate example To test it, you need to copy that example folder containing the config.json and the transformer.py files together with the JAR file to your exporters dir. See also README.md. Can you try it out? All feedback is appreciated!

beepsoft commented 3 months ago

Thanks @ErykKul, this looks cool!

One thing that is not clear to me is when transformer.py is run? At runtime, when generating the RO-Crate or beforehand to prepare a transformer.json, or similar? If it is to be run at runtime then it means we have to have python available in the same env where Dataverse is running (it is docker for me). What dependencies should be installed in this case? Can you please give some more details about this in the README.md?

ErykKul commented 3 months ago

@beepsoft It is run at runtime. That is why you need the JAR file of the exporter transformer: https://repo1.maven.org/maven2/io/gdcc/export/exporter-transformer/1.0.9/exporter-transformer-1.0.9-jar-with-dependencies.jar It contains JavaScript en Python scripting engines (that is why it is named jar-with-dependencies). When the JAR is loaded from the exporters directory (https://guides.dataverse.org/en/latest/installation/config.html#dataverse-spi-exporters-directory), it searches for the exporter folders in the same directory. An exporter is a combination of a config.json and a transformer file (transformer.json, transformer.xsl or transformer.py). I know it sounds strange, but there is really nothing more to it. just like the README.md file says: download the jar and the two transformer files ('config.json' and 'transformer.py'). Place the transformer files in a new folder inside the exporters directory, and place the jar in the exporters directory. Restart the sever, and the exporter is there. You can add up to 100 exporters this way, you need to have only that one jar, and for each exporter you make a new folder with the config and transformer files in it. You can test all of the provided examples that way, all at once, all of the exporters become available directly after a restart of the server.

beepsoft commented 3 months ago

@ErykKul, OK, I get it now, thanks!

beepsoft commented 3 months ago

@ErykKul, it works great!

One inconvenience I found is that transformer.py cannot be just edited, then remove the export_example_arp_ro_crate.cached and then reexport the dataset using the updated transformer.py as its code seems to be cached by jython or the exporter and the changes are only applied when the whole server is restarted. Or am I doing something wrong here, should it be more dynamic than that?

ErykKul commented 3 months ago

@beepsoft Yes, this is annoying... Only making the Python script more dynamic by reloading it with every export would still be confusing, since the config.json cannot be reloaded that way. Also, adding and removing exporters cannot be made dynamic. I think that a new API call would be nice: we could make a PR for it, calling it would then reload all exporters, without the need of restarting the server. Meanwhile, the best workaround is to play with the script in offline mode, by running a unit test:

mvn test -Dtest="TransformerExporterTest#testArpPythonScript" 

This test runs a transformation of an example dataset I plucked from your Dataverse: https://repo.researchdata.hu/dataset.xhtml?persistentId=hdl:21.15109/ARP/PCKHRH (it runs the Python script that you was trying to edit, it is the best place to try it out, tweak it there, and when ready, deploy it on the server)

beepsoft commented 3 months ago

Thanks @ErykKul for the explanations! Raising an issue or a submitting a PR about exporter reloading would be great! Also it would be great to have a systematic way to regenerate exports when an exporter output changes. For now, we have to manually remove the .cached files to regenerate them.

ErykKul commented 3 months ago

You can re-export a dataset, re-export all of them, or invalidate the cache by resetting the timestamps: https://guides.dataverse.org/en/latest/admin/metadataexport.html#batch-exports-through-the-api

beepsoft commented 3 months ago

Oh, great, thanks, I didn't know about that!

ErykKul commented 3 months ago

If you decide to improve on the ARP RO-Crate example, please consider a PR at the Transformer Exporter repo. It could also get its own repository and (or) a mention at the https://github.com/gdcc/dataverse-exporters.

pdurbin commented 2 months ago

I spend a little time playing with the three RO-Crate exporters created by @ErykKul and @okaradeniz and wrote up a small addition to the guides:

I'm encouraging people to try the RO-Crate exporters. And I want people searching for "RO-Crate" in the guides to be able to find something. Feedback is welcome, of course!

I'm not sure where we're going with RO-Crate support in general. The exporters are a good first step!