IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
882 stars 494 forks source link

Publishing dataset in an external repository #8001

Open pkiraly opened 3 years ago

pkiraly commented 3 years ago

We have a speific feature request, which I think would worth it to solve it with a general solution.

The original request: if a user create an Arts and Humanities dataset, s/he should be able to publish it as well on an external reporitory called DARIAH Repository.

As we know the slogan "lots of copies keep your stuff safe" I believe it would be a valid and supportable use case to create copies of the dataset into external reporitories.

Here is a suggestion for the user interface:

external-repository

The backend and the workflow would like something like this:

Here are some code snippets, to get more details:

Mapping of subjects and repositories:

public enum Subject {
  SOCIAL_SCIENCES("Social Sciences", GesisRepository.getInstance()), // a social science repo
  MEDICINE("Medicine, Health and Life Sciences"),
  EARTH("Earth and Environmental Sciences"),
  AGRICULTUR("Agricultural Sciences"),
  OTHER("Other"),
  COMPUTER("Computer and Information Science"),
  HUMANITIES("Arts and Humanities", DariahRepository.getInstance()), // a Digital Humanities repo
  ASTRONOMY("Astronomy and Astrophysics"),
  BUSINESS("Business and Management"),
  LAW("Law"),
  ENGINEERING("Engineering"),
  MATHEMATICS("Mathematical Sciences"),
  CHEMISTRY("Chemistry"),
  PHYSICS("Physics")
  ;

  private String name;
  // initialize with general repositories, which could be available for all subjects
  private List<ExternalRepository> repositories = List.of(HarvardDataverse.getInstance(),
                                                          DataverseNo.getInstance());

  Subject(String name) {
    this.name = name;
  }

  Subject(String name, ExternalRepository... repositories) {
    this(name);
    this.repositories.addAll(Arrays.asList(repositories));
  }

  public static Subject byName(String name) {
    for (Subject subject : values())
      if (subject.name.equals(name))
        return subject;
    return null;
  }

  public String getName() {
    return name;
  }

  public List<ExternalRepository> getRepositories() {
    return repositories;
  }
}

get the list of active repositories:

public List<ExternalRepository> getActiveExternalRepositories() {
    List<ExternalRepository> repositories = new ArrayList<>();
    for (String name : getDatasetSubjects()) {
        Subject subject = Subject.byName(name);
        if (subject != null && subject.getRepositories() != null)
            for (ExternalRepository repository : subject.getRepositories())
                if (repository.isActive())
                    repositories.add(repository);
    }
    return repositories;
}

@pdurbin @qqmyers @poikilotherm @djbrooke @4tikhonov I am interested in your opinion. I have some initial code to prove the concept for myself, but for a PR it needs lots of work. I would invent this time only if this idea meets community's opinion. Otherwise I will create an independent webservice specific for the DARIAH repository.

4tikhonov commented 3 years ago

Hi @pkiraly, it's well known use case. We have already developed such (external) webservice in 2017 to archive datasets in our Trusted Digital Repository (DANS EASY). However, our workflow is a bit different, first we're publishing dataset in Dataverse and using its metadata and files to create BagIt package, and archiving it afterwords. Please take a look on slides here: https://www.slideshare.net/vty/cessda-persistent-identifiers

Regarding your possible implementation, I'm pretty sure the development of webservices is the way to go. At the moment Dataverse looks too much monolithic and we have to make it prepared for the future using modern technologies and concepts.

djbrooke commented 3 years ago

(I typed this response this morning and I got sidetracked, apologies :))

I think we'd want to utilize the workflows system (https://guides.dataverse.org/en/latest/developers/workflows.html) to trigger an event to publish into the other system, and I don't think we'd want to add a flow in the Dataverse UI for this. I'd be concerned about communicating failure cases and scalability.

poikilotherm commented 3 years ago

This might be a good chance to revive discussing #7050. You already could extend Dataverse with a workflow, but this is not tied to the UI IIRC. A way to inject UI components for workflows from plugins would be great IMHO. Less forks, more extensibility.

pkiraly commented 3 years ago

Dear @djbrooke, @4tikhonov and @poikilotherm,

thanks a lot for your feedback and suggestions! I totally agree with the suggestion that Dataverse should not be extend but should work with plugins wherever it is possible.

I checked the suggested workflow documentation and the example scripts in the scripts/api/data/workflows directory, and my feeling is that it solves only one part of the feature request, i.e. the communication with external services. However an important part of our requirement is that (1) the uses should decide (2) on ad hoc basis whether or not s/he would like to publish the dataset on an external service. I do not see a possibility to set a condition parameter into the workflow which govers if the step should be executed or not.

To use the workflow for this requirement the following improvement should be taken:

Example for such a conditional step configuration:

example 1: direct entry of conditions, i.e. archive the dataset only if subject is "Arts and Humanities", the user if affiliated a Humanities organisation, and it is a new major version)

{
  "provider":":internal",
  "stepType":"http/sr",
  "parameters": {
    ...
    "conditions": [
      "${dataset.subject}=[Arts and Humanities]",
      "${user.affiliation}=[DARRIAH, Department of Humanities]",
      "${minorVersion}=0"
    ]
  }
}

example 2: the workflow should retrieve and evaluate the user's conditions, which have been set on the user's page or via API

{
  "provider":":internal",
  "stepType":"http/sr",
  "parameters": {
    ...
    "conditions": ["${user.externalArchivingConditions}"]
  }
}

A question: are you aware of any existing open source plugin for Dataverse I can check?

pdurbin commented 2 years ago

@pkiraly maybe there's a better video or screenshots @qqmyers can point us to but there's now some UI for curators to see the status of publishing/archiving to another repository. The screenshot below is from "Final Demo - Full Final demo of automatic ingests of Dataverse exports into DRS, including successful, failed, and message error scenarios" at https://github.com/harvard-lts/awesome-lts#2022-06-29-final-demo via this pull request that was merged into 5.12 (just released):

Screen Shot 2022-10-14 at 7 35 33 AM

It seems highly related at least! I think it might use a command instead of a workflow though. (No, I can't think of any plugins you can check.)

qqmyers commented 2 years ago

FWIW: Automation is via workflow (i.e. configured to post-publish), but the workflow step calls an archiving command. Those are dynamically loaded so dropping a new one in the exploded war should work. (We haven't dealt with a separate class loader yet.)