CeON / dataverse

Open source research data repository software. ICM dataverse version.
8 stars 4 forks source link

Explore possibilities of harvesting repod, rds and mxrdr resources by latest Harvard dataverse version #1436

Open madryk opened 3 years ago

madryk commented 3 years ago

Repod -> Harvard dataverse

Format DublinCore:

Example of dataset will filled all metadata:

<metadata_field from repod> -> <tag in dublin core> -> <metadata_field harvested by Harvard>
title                       -> <dc:title>           -> title
doi link                    -> <dc:identifier>      -> otherId:otherIdValue
author:authorName           -> <dc:creator>         -> author:authorName
language                    -> <dc:language>        -> language
keyword:keywordValue        -> <dc:subject>         -> keyword:keywordValue
subject                     -> <dc:subject>         -> keyword:keywordValue
dsDescription:dsDescriptionValue -> <dc:description> -> dsDescription:dsDescriptionValue
RepOD (name of root dv)     -> <dc:publisher>       -> producer:producerName
nothing                     -> nothing              -> datasetContact:datasetContactEmail == N/A
productionDate              -> <dc:date>            -> productionDate
depositor                   -> <dc:contributor>     -> contributor:contributorName
contributor:contributorName -> <dc:contributor>     -> contributor:contributorName
dataSources                 -> <dc:source>          -> dataSources
???                         -> <dc:relation>         -> ???
madryk commented 3 years ago

Repod -> Harvard dataverse

Format DDI:

Example of dataset with all filled metadata

Dataset search card presentation

original: Zrzut ekranu z 2021-02-16 22-51-29 harvested: Zrzut ekranu z 2021-02-16 22-51-35

File search card presentation

original: Zrzut ekranu z 2021-02-16 23-01-05 harvested: Zrzut ekranu z 2021-02-16 23-00-58

Translated metadata
-- citation metadata --
--- <stdyDscr><citation><titlStmt> ---
title                               -> <titl>                    -> title
subtitle                            -> <subTitl>                 -> subtitle
alternativeTitle                    -> <altTitl>                 -> alternativeTitle
author:authorName                   -> <AuthEnty>                -> author:authorName
author:authorAffiliation            -> <AuthEnty affiliation>    -> author:authorAffiliation
nothing                             -> nothing                   -> datasetContact:datasetContactEmail == N/A
--- <stdyDscr><stdyInfo> ---
dsDescription:dsDescriptionValue    -> <abstract>                -> dsDescription:dsDescriptionValue
subject                             -> <subject><keyword>        -> keyword:keywordValue
keyword:keywordValue                -> <subject><keyword>        -> keyword:keywordValue
keyword:keywordVocab                -> <subject><keyword vocab>  -> keyword:keywordVocabulary
topicClassification:topicClassValue -> <subject><topcClas>       -> topicClassification:topicClassValue
topicClassification:topicClassVocab -> <subject><topcClas vocab> -> topicClassification:topicClassVocab
notesText                           -> <notes>                   -> notesText
--- <stdyDscr><citation><prodStmt> ---
producer:producerAbbreviation       -> <producer abbr>           -> producer:producerAbbreviation
producer:producerAffiliation        -> <producer affiliation>    -> producer:producerAffiliation
producer:producerURL                -> <producer URI>            -> producer:producerURL
producer:producerName               -> <producer>                -> producer:producerName
producer:producerLogoURL            -> <producer role>           -> producer:producerLogoURL
productionDate                      -> <prodDate>                -> productionDate
productionPlace                     -> <prodPlac>                -> productionPlace
grantNumber:grantNumberAgency       -> <grantNo agency>          -> grantNumber:grantNumberAgency
grantNumber:grantNumberValue        -> <grantNo>                 -> grantNumber:grantNumberValue
--- <stdyDscr><citation><distStmt> ---               
RepOD (name of root dv)             -> <distrbtr>                -> distributor:distributorName
system publication date             -> <distDate>                -> distributionDate
timePeriodCovered:timePeriodCoveredStart -> * - <timePrd cycle=”P1” event=”start”>  -> timePeriodCovered:timePeriodCoveredStart
timePeriodCovered:timePeriodCoveredEnd   -> * - <timePrd cycle=”P1” event=”end”>    -> timePeriodCovered:timePeriodCoveredEnd
dateOfCollection:dateOfCollectionStart   -> * - <collDate cycle=”P1” event=”start”> -> dateOfCollection:dateOfCollectionStart
dateOfCollection:dateOfCollectionEnd     -> * - <collDate cycle=”P1” event=”end”>   -> dateOfCollection:dateOfCollectionEnd
-- geospatial metadata --
---<stdyDscr><stdyInfo><sumDscr> ---       
** - mess                           ->                           -> geographicCoverage:country
** - mess                           ->                           -> geographicCoverage:otherGeographicCoverage
geographicUnit                      -> <geogUnit>                -> geographicUnit
social science metadata | <stdyDscr><stdyInfo><sumDscr>   
universe                            -> <universe>                -> universe
--- <stdyDscr><method><dataColl> ---                
dataCollector                       -> <dataCollector>           -> dataCollector
collectorTraining                   -> <collectorTraining>       -> collectorTraining
frequencyOfDataCollection           -> <frequence>               -> frequencyOfDataCollection
deviationsFromSampleDesign          -> <deviat>                  -> deviationsFromSampleDesign
dataCollectionSituation             -> <callSitu>                -> dataCollectionSituation
actionsToMinimizeLoss               -> <actMin>                  -> actionsToMinimizeLoss
controlOperations                   -> <conOps>                  -> controlOperations
cleaningOperations                  -> <cleanOps>                -> cleaningOperations
--- <stdyDscr><method><anlyInfo> ---                
responseRate                        -> <respRate>                -> responseRate
otherDataAppraisal                  -> <dataAppr>                -> otherDataAppraisal
--- <stdyDscr><method> ---                                                        
socialScienceNotes:socialScienceNotesType    -> <notes type>     -> socialScienceNotes:socialScienceNotesType
socialScienceNotes:socialScienceNotesSubject -> <notes subject>  -> socialScienceNotes:socialScienceNotesSubject
socialScienceNotes:socialScienceNotesText    -> <notes>          -> socialScienceNotes:socialScienceNotesText

* - value or date attribute ** - values are mixed up in repod we have:

geographicCoverage: [{
  country: "Algeria",
  state: "Stan w Algierii?",
  city: "Nie znam"
  otherGeographicCoverage: "Elo inne"
}, {
  country: "Poland",
  state: "Woj. Mazowieckie",
  city: "Warszawa"
  otherGeographicCoverage: "Nie warszawa"

is translated to:

geographicCoverage: [{
  otherGeographicCoverage: "Nie znam; Stan w Algierii?; Elo inne"
}, {
  country: "Algeria",
  otherGeographicCoverage: "Woj. Mazowieckie"
}, {
  country: "Poland",
  otherGeographicCoverage: "Warszawa; Nie warszawa"
madryk commented 3 years ago

Repod -> Harvard dataverse

Format JSON (native):

If any of the following metadata will be filled then dataset will NOT be harvested:

Example of dataset with all filled metadata

Dataset search card presentation

original: Zrzut ekranu z 2021-02-16 22-51-29 harvested: Zrzut ekranu z 2021-02-17 00-17-35

date seems to be a date when last datasetVersion was released. This is slighly different from what will be presented in original repository. In original we will show date of last MAJOR datasetVersion release. For example: V1.0 V1.1 V2.0 - this date V2.1 year in citation is taken from metadata field - Distribution Date part Jestem, Dystrybutorem is taken from metadata field - Distributor - Name

File search card presentation

original: Zrzut ekranu z 2021-02-16 23-01-05 harvested: Zrzut ekranu z 2021-02-17 00-57-33

date seems to be a date when file was harvested. md5 and file size info are harvested, but they are not showed.

Translated metadata

All metadata will be harvested as is in original dataset, except:

wfenrich commented 3 years ago

Ad. DC on Harvard:

They seem to use dcterms:relation for related datasets and dcterms:isReferencedBy for related publication (in our case only some related publications would count as referencing).

The do not seem to make any use of related materials.

Also terms and rights need some changes on repod side.

wfenrich commented 3 years ago

My proposal of how we can inject terms into DDI:

If the dataset IS NOT under embargo

AND all files in the dataset are on the same licence or terms, but none of them is Restricted Access. 

    <notes type="DVN:TOU" level="dv">[Universal License Name]</notes>
    <notes type="DVN:TOA" level="dv"></notes>

AND files in the dataset are on different licences, but none of them is restricted access: 

    <notes type="DVN:TOU" level="dv">Different licenses or terms for individual files.</notes>
    <notes type="DVN:TOA" level="dv"></notes>

AND files in the dataset are on different licences AND at least one of them is  Restricted Access
    <notes type="DVN:TOU" level="dv">Different licenses or terms for individual files.</notes>
    <notes type="DVN:TOA" level="dv">Access to some files in this dataset is restricted.</notes>

AND all files in the dataset are restricted access AND all of them have the same subterms: 

    <notes type="DVN:TOU" level="dv"></notes>
    <notes type="DVN:TOA" level="dv">Access to all files in this dataset is restricted. [Subterms text, for instance: For academic purposes only, not for redistribution]. </notes>

AND all files in the dataset are restricted access AND they have different subterms: 

    <notes type="DVN:TOU" level="dv"></notes>
    <notes type="DVN:TOA" level="dv">Access to all files in this dataset is restricted. Different terms for individual files. </notes>

The dataset IS under embargo:

    <notes type="DVN:TOU" level="dv"></notes>
    <notes type="DVN:TOA" level="dv">Access to all files in this dataset is embargoed. </notes>
    Files in this dataset will be available from [Embargo date YYYY-MM-DD].