DILCISBoard / E-ARK-CSIP

E-ARK Common Specification for Information Packages
http://earkcsip.dilcis.eu
Creative Commons Attribution 4.0 International
11 stars 5 forks source link

Purpose of CSIP structMap #426

Closed koit closed 4 years ago

koit commented 5 years ago

The explanation of the purpose of structMap in mets.xsd and METSPrimer.pdf is clear and METSPrimer has some good examples of its use (see pages 62 and 65). The purpose of CSIP structMap is less clear and this makes it hard to contextualise the 33 requirements (and thus, to create a valid IP).

The intro text of 5.3.6. "Use of the METS structural map (element structMap)" states:

In CSIP the structMap describes the higher level structure of all the content in the root and may link to representations. /…/

  • The internal structure of the structural map (expressed by div elements) follows the CSIP high level physical structure as described in Section 4, therefore grouping together metadata, representations, schemas, documentation and user-defined folders into their own div elements; /…/
  • In case both root and representation METS files exist, the structural map in the root METS file
    • Reference the fileGrp which describes all files in all folders with the exception of the content of the representation folders
    • Lists all representations (as separate div elements)
    • Lists only the appropriate representation METS file using the mptr element as the content of the representation

This can be summed up as: "The Purpose of CSIP structMap is to mirror the physical folder structure of the IP and if representations are present then point to the METS.xml files that describe them." This conclusion is mirrored by the examples:

But why is it necessary? The same info can be derived by parsing the fileSec/fileGrp/file and mdRef elements of all METS.xml files in the IP. Processing speed could be the added value here, as the structure can be quickly read from strucMap, compared to the effort of reverse engineering the structure from the descriptions of individual files. However, there is duplication here, so a risk of conflicting descriptions.

There is also a rather softly posed requirement "Reference the fileGrp which describes all files in all folders /…/" to be used in the case of representations. It seems mandatory when representations are present, so it should be made an explicit SHOULD rule. Also, if it makes sense for representations, it is equally reasonable to have it for non-rep cases, too. There should be clear instructions on where and how to place the fileGrp references.

On another thought, shouldn't the purpose of CSIP structMap be to describe the conceptual, rather than the physical structure of the package? As the folder structure is not mandatory any more, we might see folder structures like this:

SIP_001/
       |- data1/
       |- data2/
       |- data3/
       |- data4/
       |- metadata/

In such case, the content files could be randomly spread into the data folders, e.g. to make the data folders fit some size limit. Or grouped by file name such as is often done in uuid-named or sequentially named file and folder structures, e.g. the actual structure of MS Outlook 2011 for Mac (the letters seem to mean Trillion, Billion, Million, Kilo, each folder containing up to 1000 items):

Messages/
        |- 0T/
             |- 0B/
                  |- 0M/
                       |- 6K/
                       |    |- x00_6021.olk14Message
                       |    |- x00_6022.olk14Message
                       |    |- x00_6023.olk14Message
                       |
                       |- 186K/
                       |      |- x00_186029.olk14Message
                       |      |- x00_186030.olk14Message
                       |      |- x00_186031.olk14Message
                       |      |- x00_186032.olk14Message
                       |
                       |- 192K/
                              |- x00_192058.olk14Message
                              |- x00_192059.olk14Message

A conceptual CSIP structMap would make a lot of sense in such cases.

Anyway, no matter what the purpose of CSIP structMap, it should be stated clearly, supported by the requirements and illustrated with intuitive examples.

I know it sounds like useless theorising, but structure-related requirements are currently not clear (we experienced this when creating minimal valid IPs, see https://github.com/DILCISBoard/eark-ip-test-corpus/pull/211 and https://github.com/DILCISBoard/eark-ip-test-corpus/pull/212), and I've got a hunch that the unclarity might stem from unclear purpose statements.

koit commented 5 years ago

After analysing it for a few days, the issue of structMap and fileSec labour division seems overwhelming, with too many loose ends.

A good example is referencing. mets.xsd prescribes referencing in only one direction and only at the individual file level: from structMap/div/fptr to fileSec/fileGrp/file. There is no "proper" way to reference a fileGrp (the workaround via structMap/div/@CONTENTIDS involves a type mismatch of xs:anyURI vs xml:ID).

Alternatives for structMap/div: A) folders-only, i.e. no references to fileSec; B) references to fileSec/fileGrp using structMap/div/@CONTENTIDS; C) references to all individual files using structMap/div/fptr.

Each has its strengths and weaknesses. Also, A and B would greatly benefit from structuring fileGrp to mirror the folder structure. Should it be a complete tree of fileGrp elements or just a flat list where each fileGrp describes the files of one folder? The latter case could be facilitated by adding a new attribute fileGrp/@csip:folderName to indicate the folder name (this can be done as fileGrpType has the xs:anyAttribute). Another option would be to introduce reverse referencing by creating fileGrp/@csip:structMapDivID to point to a folder div in the structMap.

While making the choice we also need to consider:

Version 2 schedule leaves no time to properly consider all these aspects. So I propose we fix only the obvious mistakes and otherwise leave the current solution as it is. Soon after the release of v.2.0 we should create a task force to develop a complete solution for structMap and fileSec. This should involve analysis of real life IPs from different institutions and prototyping complete IPs for different alternative solutions.

karinbredenberg commented 5 years ago

What hasnt been handled is moved to the next milestone.

carlwilson commented 4 years ago

I feel what hasn't been handled is pushed to the next milestone but that this gets serious consideration then. I think a response now might be rushed as we're likely to need some good test cases to illustrate all of the issues. In general, I'm against repetition (I tend to regard all repetition as unnecessary) as it leads to internal inconsistency, i.e. chaos.

karinbredenberg commented 4 years ago

This needs to be pushed to the next major version update. Needs more discussion and investigation to see if the concerns have already been handled and if more rewording is needed.