IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 491 forks source link

method to create a dataset based on SWORDv2 Atom Entry #572

Closed eaquigley closed 10 years ago

eaquigley commented 10 years ago

Author Name: Philip Durbin (@pdurbin) Original Redmine Issue: 3991, https://redmine.hmdc.harvard.edu/issues/3991 Original Date: 2014-05-19 Original Assignee: Leonid Andreev


Given an SWORDv2 Atom Entry with a Dataverse-specific "attribute hack" for "dcterms:isReferencedBy", we need a method that will produce a dataset.

We use the extra attributes in dcterms:isReferencedBy (holdingsURI, agency, and IDNo) to link back to journal articles in Open Journal Systems (OJS) as described at http://www.mail-archive.com/sword-app-tech@lists.sourceforge.net/msg00386.html

Here's the crosswalk that's used in DVN 3.x: https://github.com/IQSS/dvn/blob/master/working_directory/dcmi_terms2ddi.xsl

(Not that the new solution needs to be a crosswalk.)

We might need to think some more about required fields, such as "Subject" and "Contact E-mail". Should we silently fill in "Other" for "Subject". Should we silently fill in the contact email of the parent dataverse for "Contact E-mail"? "Author" and "Description" are also required... should the method not create a dataset object if dcterms:creator and dcterms:description are not populated?

It's unclear if "dcterms:identifier" should be supported or not. In DVN 3.x, it allowed you to modify the globalId of a study under some conditions.

<entry xmlns="http://www.w3.org/2005/Atom"
       xmlns:dcterms="http://purl.org/dc/terms/">
   <dcterms:title>Roasting at Home</dcterms:title>
   <dcterms:creator>Peets, John</dcterms:creator>
   <dcterms:creator>Stumptown, Jane</dcterms:creator>
   <!-- Producer with financial or admin responsibility of the data -->
   <dcterms:publisher>Coffee Bean State University</dcterms:publisher>
   <!-- related publications -->
   <dcterms:isReferencedBy holdingsURI="http://dx.doi.org/10.1038/dvn333" agency="DOI"
       IDNo="10.1038/dvn333">Peets, J., &amp; Stumptown, J. (2013). Roasting at Home. New England Journal of Coffee, 3(1), 22-34.</dcterms:isReferencedBy>
   <!-- production date -->
   <dcterms:date>2013-07-11</dcterms:date>
   <!-- Other Identifier for the data in this study (or potentially global id if unused) -->
   <!--
   <dcterms:identifier>hdl:1XXZY.1/XYXZ</dcterms:identifier>
   -->
   <dcterms:description>Considerations before you start roasting your own coffee at home.</dcterms:description>
   <!-- keywords -->
   <dcterms:subject>coffee</dcterms:subject>
   <dcterms:subject>beverage</dcterms:subject>
   <dcterms:subject>caffeine</dcterms:subject>
   <!-- geographic coverage -->
   <dcterms:coverage>United States</dcterms:coverage>
   <dcterms:coverage>Canada</dcterms:coverage>
   <!-- kind of data -->
   <dcterms:type>aggregate data</dcterms:type>
   <!-- List of sources of the data collection-->
   <dcterms:source>Stumptown, Jane. 2011. Home Roasting. Coffeemill Press.</dcterms:source>
   <!-- restrictions -->
   <dcterms:rights>Creative Commons CC-BY 3.0 (unported) http://creativecommons.org/licenses/by/3.0/</dcterms:rights>
   <!-- related materials -->
   <dcterms:relation>Peets, John. 2010. Roasting Coffee at the Coffee Shop. Coffeemill Press</dcterms:relation>
</entry>

(For a more generic SWORDv2 Atom Entry example, see http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#protocoloperations_editingcontent_metadata )


Redmine related issue(s): 3385


eaquigley commented 10 years ago

Original Redmine Comment Author Name: Philip Durbin (@pdurbin) Original Date: 2014-05-19T20:51:06Z


Assigning to Leonid based on discussion with him and Gustavo.

pdurbin commented 10 years ago

I'm actively testing the importXML method with SWORD. Please look for "572" (this ticket number) in this Google Doc for related issues: https://docs.google.com/document/d/11DpdKyp1tagmaJAAzRqQBEZEZ69WOOOYsoqhz8UsfNM/edit?usp=sharing

pdurbin commented 10 years ago

@landreev as I just mentioned, I was surprised to discover that I can put "foo" in the metadatablockname column for a field and importXML still just works fine.

I discovered this because @posixeleni moved kindOfData from one block to another in 596d82c8934846807968fb7f5a24c91edcc7ec7e in #754 and I assumed I'd need to update the INSERT INTO foreignmetadatafieldmapping line in the SQL reference data script. It turns out I didn't need to... but now I'm wondering... if that metadatablockname column isn't even being used, should we simply remove it? Otherwise, that data is going to get stale.

landreev commented 10 years ago

Still need to look into this... I thought it made sense to look up the fields by both the name of the field and the block name... Which apparently my code isn't doing at this point. But rather than removing the metadtatablockname column, I would really rather fix it so that it does look up on both. I mean, it seems wrong to assume that field names are unique across metadatablocks.

pdurbin commented 10 years ago

I mean, it seems wrong to assume that field names are unique across metadatablocks.

Maybe. For Advanced Search to work, I assume field names are unique. @posixeleni changed type to astroType for me. If this assumption is problematic let's hash it out sooner rather than later. Personally, I'll like to see the uniqueness of field names get enforced while the tsv files are being loaded.

That said, your proposed fix to the importXML method seems fine. Belt and suspenders I guess... the lookup on both. I just don't want a column that isn't used at all.

landreev commented 10 years ago

OK, so we are talking about the metadatablockname column. No, it is not being used. And yes, it is redundant. Because yes, metadata field names are guaranteed to be unique across all metadata blocks. I thought initially we were talking about the formatname column, in the metadata mapping itself; where uniqueness is not guaranteed. Where my comment from DatasetFieldService applies:

/* * Similar method for looking up foreign metadata field mappings, for metadata * imports. for these the uniquness of names isn't guaranteed (i.e., there * can be a field "author" in many different formats that we want to support), * so these have to be looked up by both the field name and the name of the * foreign format. */ public ForeignMetadataFieldMapping findFieldMapping(String formatName, String pathName) { ... }

OK, I'm going ahead and removing the column, as redundant. Do note that we are junking this implementation of foreign metadata import. But there is a chance that the class and the table ForeignMetadataFieldMapping that I created could still be used there; likely with lots of additions/modifications. So that's the only reason I'm still willing to spend any time maintaining it.

landreev commented 10 years ago

I'm putting this ticket into QA. The only QA applicable would be to try some SWORD ingest test that was working before; and check if it's still working. No changes in functionality/logic were actually made. I only dropped one column from one db table that was not being used. (a db update isn't necessary either - if there's a column in a table that's no longer being used, ejb is ok with that)

pdurbin commented 10 years ago

a db update isn't necessary either

@landreev yes, but we need to update the reference data script. I'll steal this ticket from QA.

pdurbin commented 10 years ago

we need to update the reference data script

Fixed in 7e9c3d1. Passing to QA.

esotiri commented 10 years ago

pulled the latest, dropped the db the reference data script fix worked.

Creating a dataset and adding a file (png) via sword worked ok.