Closed koit closed 4 years ago
After analysing it for a few days, the issue of structMap
and fileSec
labour division seems overwhelming, with too many loose ends.
A good example is referencing. mets.xsd prescribes referencing in only one direction and only at the individual file level: from structMap/div/fptr
to fileSec/fileGrp/file
. There is no "proper" way to reference a fileGrp
(the workaround via structMap/div/@CONTENTIDS
involves a type mismatch of xs:anyURI
vs xml:ID
).
Alternatives for structMap/div
:
A) folders-only, i.e. no references to fileSec
;
B) references to fileSec/fileGrp
using structMap/div/@CONTENTIDS
;
C) references to all individual files using structMap/div/fptr
.
Each has its strengths and weaknesses. Also, A and B would greatly benefit from structuring fileGrp
to mirror the folder structure. Should it be a complete tree of fileGrp
elements or just a flat list where each fileGrp
describes the files of one folder? The latter case could be facilitated by adding a new attribute fileGrp/@csip:folderName
to indicate the folder name (this can be done as fileGrpType
has the xs:anyAttribute
). Another option would be to introduce reverse referencing by creating fileGrp/@csip:structMapDivID
to point to a folder div in the structMap
.
While making the choice we also need to consider:
SIP_001/representations/rep1/METS.xml
) or the root folder of the IP (e.g. representations/rep1/METS.xml
), and what about the paths in representation METS.xml;fileGrp
;Version 2 schedule leaves no time to properly consider all these aspects. So I propose we fix only the obvious mistakes and otherwise leave the current solution as it is. Soon after the release of v.2.0 we should create a task force to develop a complete solution for structMap and fileSec. This should involve analysis of real life IPs from different institutions and prototyping complete IPs for different alternative solutions.
What hasnt been handled is moved to the next milestone.
I feel what hasn't been handled is pushed to the next milestone but that this gets serious consideration then. I think a response now might be rushed as we're likely to need some good test cases to illustrate all of the issues. In general, I'm against repetition (I tend to regard all repetition as unnecessary) as it leads to internal inconsistency, i.e. chaos.
This needs to be pushed to the next major version update. Needs more discussion and investigation to see if the concerns have already been handled and if more rewording is needed.
The explanation of the purpose of structMap in mets.xsd and METSPrimer.pdf is clear and METSPrimer has some good examples of its use (see pages 62 and 65). The purpose of CSIP structMap is less clear and this makes it hard to contextualise the 33 requirements (and thus, to create a valid IP).
The intro text of 5.3.6. "Use of the METS structural map (element structMap)" states:
This can be summed up as: "The Purpose of CSIP structMap is to mirror the physical folder structure of the IP and if representations are present then point to the METS.xml files that describe them." This conclusion is mirrored by the examples:
But why is it necessary? The same info can be derived by parsing the
fileSec/fileGrp/file
andmdRef
elements of all METS.xml files in the IP. Processing speed could be the added value here, as the structure can be quickly read from strucMap, compared to the effort of reverse engineering the structure from the descriptions of individual files. However, there is duplication here, so a risk of conflicting descriptions.There is also a rather softly posed requirement "Reference the fileGrp which describes all files in all folders /…/" to be used in the case of representations. It seems mandatory when representations are present, so it should be made an explicit SHOULD rule. Also, if it makes sense for representations, it is equally reasonable to have it for non-rep cases, too. There should be clear instructions on where and how to place the fileGrp references.
On another thought, shouldn't the purpose of CSIP structMap be to describe the conceptual, rather than the physical structure of the package? As the folder structure is not mandatory any more, we might see folder structures like this:
In such case, the content files could be randomly spread into the data folders, e.g. to make the data folders fit some size limit. Or grouped by file name such as is often done in uuid-named or sequentially named file and folder structures, e.g. the actual structure of MS Outlook 2011 for Mac (the letters seem to mean Trillion, Billion, Million, Kilo, each folder containing up to 1000 items):
A conceptual CSIP structMap would make a lot of sense in such cases.
Anyway, no matter what the purpose of CSIP structMap, it should be stated clearly, supported by the requirements and illustrated with intuitive examples.
I know it sounds like useless theorising, but structure-related requirements are currently not clear (we experienced this when creating minimal valid IPs, see https://github.com/DILCISBoard/eark-ip-test-corpus/pull/211 and https://github.com/DILCISBoard/eark-ip-test-corpus/pull/212), and I've got a hunch that the unclarity might stem from unclear purpose statements.