Open pkiraly opened 3 years ago
Hi @pkiraly, it's well known use case. We have already developed such (external) webservice in 2017 to archive datasets in our Trusted Digital Repository (DANS EASY). However, our workflow is a bit different, first we're publishing dataset in Dataverse and using its metadata and files to create BagIt package, and archiving it afterwords. Please take a look on slides here: https://www.slideshare.net/vty/cessda-persistent-identifiers
Regarding your possible implementation, I'm pretty sure the development of webservices is the way to go. At the moment Dataverse looks too much monolithic and we have to make it prepared for the future using modern technologies and concepts.
(I typed this response this morning and I got sidetracked, apologies :))
I think we'd want to utilize the workflows system (https://guides.dataverse.org/en/latest/developers/workflows.html) to trigger an event to publish into the other system, and I don't think we'd want to add a flow in the Dataverse UI for this. I'd be concerned about communicating failure cases and scalability.
This might be a good chance to revive discussing #7050. You already could extend Dataverse with a workflow, but this is not tied to the UI IIRC. A way to inject UI components for workflows from plugins would be great IMHO. Less forks, more extensibility.
Dear @djbrooke, @4tikhonov and @poikilotherm,
thanks a lot for your feedback and suggestions! I totally agree with the suggestion that Dataverse should not be extend but should work with plugins wherever it is possible.
I checked the suggested workflow documentation and the example scripts in the scripts/api/data/workflows
directory, and my feeling is that it solves only one part of the feature request, i.e. the communication with external services. However an important part of our requirement is that (1) the uses should decide (2) on ad hoc basis whether or not s/he would like to publish the dataset on an external service. I do not see a possibility to set a condition parameter into the workflow which govers if the step should be executed or not.
To use the workflow for this requirement the following improvement should be taken:
Example for such a conditional step configuration:
example 1: direct entry of conditions, i.e. archive the dataset only if subject is "Arts and Humanities", the user if affiliated a Humanities organisation, and it is a new major version)
{
"provider":":internal",
"stepType":"http/sr",
"parameters": {
...
"conditions": [
"${dataset.subject}=[Arts and Humanities]",
"${user.affiliation}=[DARRIAH, Department of Humanities]",
"${minorVersion}=0"
]
}
}
example 2: the workflow should retrieve and evaluate the user's conditions, which have been set on the user's page or via API
{
"provider":":internal",
"stepType":"http/sr",
"parameters": {
...
"conditions": ["${user.externalArchivingConditions}"]
}
}
A question: are you aware of any existing open source plugin for Dataverse I can check?
@pkiraly maybe there's a better video or screenshots @qqmyers can point us to but there's now some UI for curators to see the status of publishing/archiving to another repository. The screenshot below is from "Final Demo - Full Final demo of automatic ingests of Dataverse exports into DRS, including successful, failed, and message error scenarios" at https://github.com/harvard-lts/awesome-lts#2022-06-29-final-demo via this pull request that was merged into 5.12 (just released):
It seems highly related at least! I think it might use a command instead of a workflow though. (No, I can't think of any plugins you can check.)
FWIW: Automation is via workflow (i.e. configured to post-publish), but the workflow step calls an archiving command. Those are dynamically loaded so dropping a new one in the exploded war should work. (We haven't dealt with a separate class loader yet.)
We have a speific feature request, which I think would worth it to solve it with a general solution.
The original request: if a user create an Arts and Humanities dataset, s/he should be able to publish it as well on an external reporitory called DARIAH Repository.
As we know the slogan "lots of copies keep your stuff safe" I believe it would be a valid and supportable use case to create copies of the dataset into external reporitories.
Here is a suggestion for the user interface:
The backend and the workflow would like something like this:
getName()
: returns the name of the repositorygetUrl()
: returns the URL of the repository's starting pagepublish(DatasetVersion datasetVersion)
: the main method which publish the dataset in the repositoryisActive()
: returns if the repository is turned on in the current Dataverse instance (by default all are turned off, the site admin can activate them via configurationHere are some code snippets, to get more details:
Mapping of subjects and repositories:
get the list of active repositories:
@pdurbin @qqmyers @poikilotherm @djbrooke @4tikhonov I am interested in your opinion. I have some initial code to prove the concept for myself, but for a PR it needs lots of work. I would invent this time only if this idea meets community's opinion. Otherwise I will create an independent webservice specific for the DARIAH repository.