IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 488 forks source link

Harvester exporter uses short version of XML #8778

Open lubitchv opened 2 years ago

lubitchv commented 2 years ago

What steps does it take to reproduce the issue?

It occurs during harvesting dataverse via oai, for example: curl http:/localhost:8080/oai?metadataPrefix=oai_ddi&verb=GetRecord&identifier=$PERSISTENT_ID

Harvesting through OAI_DDIExporter only harvests short version of DDI without fileDscr and dataDscr sections. It uses datasetJson2ddi(JsonObject datasetDtoAsJson, OutputStream outputStream) with dtoddi(DatasetDTO datasetDto, OutputStream outputStream) instead of datasetJson2ddi(JsonObject datasetDtoAsJson, DatasetVersion version, OutputStream outputStream) The solution maybe to replace datasetJson2ddi with 2 parameters to datasetJson2ddi with 3 parameters.

To get full xml with fileDscr and dataDscr sections. The same xml, one will get through api dataverse export metadata.

pdurbin commented 2 years ago

I wonder if this related to this issue:

That is, would the client do with the fileDscr and dataDscr sections? Create search cards for harvest files? Offhand I'm not sure what dataDscr is used for.

lubitchv commented 2 years ago

dataDscr has information about variables. The information that can be edited by Data Curation tool. We are transitioning from Nesstar to Dataverse. Nesstar was very good with editing variable metadata and even preseving it in SPSS. We have a search application that used Nesstar as a repositorty. The xml metadata was harvested from where to marklogic database that was connected to our search website. It was used for better data discovery.

We want to use similar approach with dataverse. Transfer data from Nesstar to Dataverse. Then do harvesting from dataverse to marklogic to still use our application for search. The search in marklogic seems to be faster, and users are used to it. Search also includes search by variables. We have old version of DataExplorer embedded in that application. We want to remove it and have a link to Dataverse Data Explorer, but for this we will need fileId from Dataverse which is in fileDscr section. So dataverse would be a repository for data. Search application would be used as a front for dataverse for data discovery for regular users, they would not see dataverse. Dataverse would be used by librarians to upload datasets and curate metadata and variable metadata with DCT.

pdurbin commented 2 years ago

@lubitchv thanks, it's all very clear now!

Are you thinking about making a pull request for this?

lubitchv commented 2 years ago

Yes, I will make a pull request.

landreev commented 2 years ago

I feel bad having missed this issue and the discussion above entirely. But I just saw the PR and I'm really not comfortable with this solution, that simply makes the "OAI_DDI" format exactly the same as the "DDI". If nothing else, there wouldn't be any point in maintaining these 2 separate formats if they were identical. I would definitely want to keep the current "DDI light" (no Data sections) format around. The data sections can make these exported DDI records huge, for datasets with large numbers of tabular files, and therefore expensive to parse. Many, or most harvesting clients do not need the variable information at all, so I would prefer not to make harvesting more expensive for them.

If there is a use case where someone may have a need to harvest the full DDI, I would simply make BOTH flavors of the DDI harvestable. This is controlled by these boolean in the Exporter class: in OAI_DDIExporter.java:

    @Override
    public Boolean isHarvestable() {
        return true;
    }

in DDIExporter.java:

@Override
public Boolean isHarvestable() {
    // No, we don't want this format to be harvested!
    // For datasets with tabular data the <data> portions of the DDIs 
    // become huge and expensive to parse; even as they don't contain any 
    // metadata useful to remote harvesters. -- L.A. 4.5
    return false;
}

Changing the above to true would make the "long" format harvestable as well. (Of course it would help to maybe rename it something more explicit, like "DDI with variables" or "DDI Full" - something like that?

landreev commented 2 years ago

(edited the comment above for clarity and typos)

landreev commented 2 years ago

TL;DR/short version: I feel like a good solution would be to a) leave the oai_ddi format as is; and this will ensure that nothing changes for all the already configured clients harvesting this format from Dataverse installations and b) rename the current "ddi" format "ddi_full" (or something similar) and make it harvestable.

scolapasta commented 2 years ago

@landreev This makes sense to me. @lubitchv what do you think?

lubitchv commented 2 years ago

Yes, it make sense to me as well. I will make DDI format harverstable. Regarding renaming, would it be enough to rename it only in Bundle.properties dataset.exportBtn.itemLabel.ddi=DDI full, such that it will be visible in UI or should whole class be renamed? In later case there are more chances to brake something.

landreev commented 2 years ago

Sure, just changing the label is better. To change the short name of the format, you wouldn't need to rename the class, but just change this line in DDIExporter.java:

public static final String PROVIDER_NAME = "ddi";

But, you are absolutely right, that would introduce backward incompatibility. There are few tests that relies on this format name; and it's used in the UI. It would need to be changed in all those places; but then there are most likely users who use the API as /api/datasets/export?exporter=ddi in their scripts also... So, yes, good call!