Purpose of CSIP structMap

The explanation of the purpose of structMap in mets.xsd and METSPrimer.pdf is clear and METSPrimer has some good examples of its use (see pages 62 and 65). The purpose of CSIP structMap is less clear and this makes it hard to contextualise the 33 requirements (and thus, to create a valid IP).

The intro text of 5.3.6. "Use of the METS structural map (element structMap)" states:

In CSIP the structMap describes the higher level structure of all the content in the root and may link to representations. /…/

The internal structure of the structural map (expressed by div elements) follows the CSIP high level physical structure as described in Section 4, therefore grouping together metadata, representations, schemas, documentation and user-defined folders into their own div elements; /…/

In case both root and representation METS files exist, the structural map in the root METS file

Reference the fileGrp which describes all files in all folders with the exception of the content of the representation folders

Lists all representations (as separate div elements)

Lists only the appropriate representation METS file using the mptr element as the content of the representation

This can be summed up as: "The Purpose of CSIP structMap is to mirror the physical folder structure of the IP and if representations are present then point to the METS.xml files that describe them." This conclusion is mirrored by the examples:

structMapExample1 in CSIP.xml is pure folder structure
structMapExample2 is folder structure + mptr to the METS.xml that describes the files in specific folders.

But why is it necessary? The same info can be derived by parsing the fileSec/fileGrp/file and mdRef elements of all METS.xml files in the IP. Processing speed could be the added value here, as the structure can be quickly read from strucMap, compared to the effort of reverse engineering the structure from the descriptions of individual files. However, there is duplication here, so a risk of conflicting descriptions.

There is also a rather softly posed requirement "Reference the fileGrp which describes all files in all folders /…/" to be used in the case of representations. It seems mandatory when representations are present, so it should be made an explicit SHOULD rule. Also, if it makes sense for representations, it is equally reasonable to have it for non-rep cases, too. There should be clear instructions on where and how to place the fileGrp references.

On another thought, shouldn't the purpose of CSIP structMap be to describe the conceptual, rather than the physical structure of the package? As the folder structure is not mandatory any more, we might see folder structures like this:

SIP_001/
       |- data1/
       |- data2/
       |- data3/
       |- data4/
       |- metadata/

In such case, the content files could be randomly spread into the data folders, e.g. to make the data folders fit some size limit. Or grouped by file name such as is often done in uuid-named or sequentially named file and folder structures, e.g. the actual structure of MS Outlook 2011 for Mac (the letters seem to mean Trillion, Billion, Million, Kilo, each folder containing up to 1000 items):

Messages/
        |- 0T/
             |- 0B/
                  |- 0M/
                       |- 6K/
                       |    |- x00_6021.olk14Message
                       |    |- x00_6022.olk14Message
                       |    |- x00_6023.olk14Message
                       |
                       |- 186K/
                       |      |- x00_186029.olk14Message
                       |      |- x00_186030.olk14Message
                       |      |- x00_186031.olk14Message
                       |      |- x00_186032.olk14Message
                       |
                       |- 192K/
                              |- x00_192058.olk14Message
                              |- x00_192059.olk14Message

A conceptual CSIP structMap would make a lot of sense in such cases.

Anyway, no matter what the purpose of CSIP structMap, it should be stated clearly, supported by the requirements and illustrated with intuitive examples.

I know it sounds like useless theorising, but structure-related requirements are currently not clear (we experienced this when creating minimal valid IPs, see https://github.com/DILCISBoard/eark-ip-test-corpus/pull/211 and https://github.com/DILCISBoard/eark-ip-test-corpus/pull/212), and I've got a hunch that the unclarity might stem from unclear purpose statements.

After analysing it for a few days, the issue of structMap and fileSec labour division seems overwhelming, with too many loose ends.

A good example is referencing. mets.xsd prescribes referencing in only one direction and only at the individual file level: from structMap/div/fptr to fileSec/fileGrp/file. There is no "proper" way to reference a fileGrp (the workaround via structMap/div/@CONTENTIDS involves a type mismatch of xs:anyURI vs xml:ID).

Alternatives for structMap/div: A) folders-only, i.e. no references to fileSec; B) references to fileSec/fileGrp using structMap/div/@CONTENTIDS; C) references to all individual files using structMap/div/fptr.

Each has its strengths and weaknesses. Also, A and B would greatly benefit from structuring fileGrp to mirror the folder structure. Should it be a complete tree of fileGrp elements or just a flat list where each fileGrp describes the files of one folder? The latter case could be facilitated by adding a new attribute fileGrp/@csip:folderName to indicate the folder name (this can be done as fileGrpType has the xs:anyAttribute). Another option would be to introduce reverse referencing by creating fileGrp/@csip:structMapDivID to point to a folder div in the structMap.

While making the choice we also need to consider:

The differences between root METS.xml and representation METS.xml;
Should the representations be completely independent or should they be "aware" of the package they are in (i.e. should representation METS.xml contain references to the IP);
Where to base the folder/file paths: parent folder of the IP (e.g. the path to representation METS file could be SIP_001/representations/rep1/METS.xml) or the root folder of the IP (e.g. representations/rep1/METS.xml), and what about the paths in representation METS.xml;
The differences in case of segmentation;
Performance issues of the alternatives (see #83 for a METS.xml sample with comments from @andersbonielsen);
Is it feasible to strictly prescribe the usage model of fileGrp;
How to handle flat IPs (some reviewers of v.2.0 draft pointed out that they don't use any folder structure).

Version 2 schedule leaves no time to properly consider all these aspects. So I propose we fix only the obvious mistakes and otherwise leave the current solution as it is. Soon after the release of v.2.0 we should create a task force to develop a complete solution for structMap and fileSec. This should involve analysis of real life IPs from different institutions and prototyping complete IPs for different alternative solutions.

What hasnt been handled is moved to the next milestone.

I feel what hasn't been handled is pushed to the next milestone but that this gets serious consideration then. I think a response now might be rushed as we're likely to need some good test cases to illustrate all of the issues. In general, I'm against repetition (I tend to regard all repetition as unnecessary) as it leads to internal inconsistency, i.e. chaos.

This needs to be pushed to the next major version update. Needs more discussion and investigation to see if the concerns have already been handled and if more rewording is needed.

DILCISBoard / E-ARK-CSIP

Purpose of CSIP structMap #426