eprintsug / EPrintsArchivematica

Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin to Archivematica
6 stars 1 forks source link

metadata.json format for Archivematica #11

Closed photomedia closed 3 years ago

photomedia commented 5 years ago

According to the spec, we are exporting basic Dublin Core metadata in this file: eprintid-date/metadata/metadata.json This file contains basic Dublin Core formatted in a way that Archivematica can index/process. Can we have a link to the specifications for what that JSON format is/how Archivematica wants the fields named? An example JSON file would be helpful as well.

tw4l commented 5 years ago

Here is the relevant section from the Archivematica 1.10 documentation: https://www.archivematica.org/en/docs/archivematica-1.10/user-manual/transfer/import-metadata/

tw4l commented 5 years ago

Link to JSON specification: https://www.archivematica.org/en/docs/archivematica-1.10/user-manual/transfer/import-metadata/#adding-metadata-using-json-files

tw4l commented 5 years ago

Note that since we want to add metadata that applies to the entire transfer rather than individual files, we would want to construct the JSON slightly differently than in the example provided above. Instead of "filename": "(whatever file name)", we would want to write "parts": "objects" (based on previous experience; I will verify this).

tw4l commented 4 years ago

Example showing how to handle repeatable fields ("dc.subject" in this case):

[
  {
     "filename": "objects/",
      "dc.subject": [
          "DNA sonification",
          "second keyword"
        ],
    "dc.title": "Carex Siderosticta Plastid - Photosystem II",
    "dc.description": "This composition is based on musical scores generated by software we developed that maps DNA sequences into musical notation. This particular example converts genes responsible for photosynthesis (photosystem II) found on the plastid of a carex siderosticta plant. We then had that score performed on two violins. We focused on the coding of the photosystem genes. However, the development of the software means that one could quickly convert any of the over 100 million individual sequences in GenBank into a musical score.\r\nThe software we developed parses FASTA nucleotide coding sequence files, and maps these into a musical composition. The algorithm maps each of the 20 amino acids onto specific pitches and each codon synonym onto duration for those pitches. The mapping is listed in Table 1. We used quartertones to be able to keep the results inside an octave. Rests at the end of a bar are added to create an 8/4 time signature – each amino acid note duration is based on which codon synonym appears in the sequence. The rests are added to avoid having to split notes across a bar. Our program writes out the results into two musical score files, one for the genes on each of the two strands of DNA. The resulting files use Lilypond format to express a musical score for the genes located on each strand of DNA. Finally, the open source Lilypond program is used to generate the PDF and MIDI files of the score and the two scores are played simultaneously.",
    "dc.date": "2016"
  }
]
photomedia commented 4 years ago

Running some testing on our development repository, and the metadata.json file doesn't look right, there are some nesting issues. Here is what the file looks like for an eprint with multiple documents:

{
    "dc.rights": [
      [
        [
          [
            [
              "term_access",
null
            ],
null
          ],
null
        ],
null
      ],
null
    ],
    "dc.relation": "https:\/\/repository...\/9550\/",
    "dc.creator": [
      [
        "Carson, Pamela",
        "Neugebauer, Tomasz"
      ],
      "Han, Bin"
    ],
    "dc.format": [
      [
        [
          [
            [
              "text",
              "image"
            ],
            "image"
          ],
          "image"
        ],
        "image"
      ],
      "image"
    ],
    "dc.title": "TEST - ARCHIVEMATICA",
    "dc.date": "2014-11-20",
    "dc.type": [
      "Article",
      "NonPeerReviewed"
    ],
    "dc.language": [
      [
        [
          [
            [
              "en",
              "en"
            ],
            "en"
          ],
          "en"
        ],
        "en"
      ],
      "en"
    ],
    "dc.identifier": [
      [
        [
          [
            [
              [
                "https:\/\/repository...docx",
                "https:\/\/repository...image1.png"
              ],
              "https:\/\/repository...image3.png"
            ],
            "https:\/\/repository...image5.png"
          ],
          "https:\/\/repository...image4.png"
        ],
        "https:\/\/repository...image2.png"
      ],
      " Neugebauer, Tomasz ORCID: https:\/\/orcid.org\/0000-0002-9743-5910 <https:\/\/orcid.org\/0000-0002-9743-5910> and Han, Bin  (2014) TEST - ARCHIVEMATICA.        (Submitted)  "
    ]
  }
photomedia commented 3 years ago

The issue of the JSON export is resolved with this commit: https://github.com/eprintsug/EPrintsArchivematica/commit/1977276e1b310b95f6006851c1cbd656d69bc2a3 The NULL values in dc.rights were actually just a local bug in the DC export (this JSON draws on the DC). The updated output looks like this for this test item:

  {
    "dc.date": [
      "2014-11-20"
    ],
    "dc.creator": [
      "Carson, Pamela",
      "Neugebauer, Tomasz",
      "Han, Bin"
    ],
    "dc.identifier": [
      "https:\/\/lib...Draft.docx",
      "https:\/\/lib...image1.png",
      "https:\/\/lib...image3.png",
      "https:\/\/lib...image5.png",
      "https:\/\/lib...image4.png",
      "https:\/\/lib...image2.png",
      "  Carson, Pamela, Neugebauer, Tomasz ORCID: https:\/\/orcid.org\/0000-0002-9743-5910 <https:\/\/orcid.org\/0000-0002-9743-5910> and Han, Bin  (2014) TEST - ARCHIVEMATICA.        (Submitted)  "
    ],
    "dc.type": [
      "Article",
      "NonPeerReviewed"
    ],
    "dc.rights": [
      "term_access",
      "cc_by",
      "cc_gnu_gpl",
      "term_access"
    ],
    "dc.language": [
      "en",
      "en",
      "en",
      "en",
      "en",
      "en"
    ],
    "dc.relation": [
      "https:\/\/lib..."
    ],
    "dc.title": [
      "TEST - ARCHIVEMATICA"
    ],
    "dc.format": [
      "text",
      "image",
      "image",
      "image",
      "image",
      "image"
    ]
  }