DILCISBoard / E-ARK-CSIP

E-ARK Common Specification for Information Packages
http://earkcsip.dilcis.eu
Creative Commons Attribution 4.0 International
11 stars 5 forks source link

CSIP1 OBJID can have illegal characters for a folder name #700

Open luis100 opened 1 year ago

luis100 commented 1 year ago

An OBJID is a string, but not all characters can be created as a folder name, for example: \/:

But, simultaneously, OBJID can have meaning in the production system, where it would be good to maintain the same characters.

In commons-ip, we have worked around this situation by allowing any OBJID, but then replacing any illegal characters with _.

But this conflicts with CSIP1, where: For the package METS document, this should be the name/ID of the package, i.e. the name of the package root folder..

We need an exception or approach for illegal folder characters, like replacing or encoding these characters.

luis100 commented 1 year ago

Reference of the list of characters and reserved file names with should avoid to keep compatibility in several operative systems: https://stackoverflow.com/a/31976060/10386423

luis100 commented 1 year ago

Note that there are many other folder names that are illegal, for example: using only spaces. I recommend looking into a solution that would ensure all possible cases, like URL encoding the OBJID in the folder name.

luis100 commented 1 year ago

On commons-ip we have gone forward and using URL encoding to encode folder illegal characters in the top directory, but this may create SIPs that would not pass validation with other validators. Guidance on the specification is needed to ensure all validators treat this cases in a similar way.

prettybits commented 1 year ago

It seems to me like this should be mostly a matter of being more careful in the wording of CSIP1, mainly the second half of its description ("i.e. the name of the package root folder."). The related structural requirement CSIPSTR2 has a cardinality of SHOULD, which would allow for the information package root folder name to differ from the package identifier. I believe the intention here is that both should be the same if possible, but file system limitation are a prominent reason where that isn't always the case.

A validator should therefore not treat this as an error, allowing in principle for differing approaches to filename sanitization where needed.

Offering guidance on this in the spec would probably be a good addition. The description for CSIP1 should be adjusted, e.g. by adding a small qualifier like "commonly" and referring to CSIPSTR2?

shsdev commented 9 months ago

I discussed this with @prettybits and suggested that a character mapping could be recommended, such as the one that is included in the Pairtree specification, section 3 "Identifier string cleaning". Relevant parts of this character mapping approach could be recommended to avoid characters which are problematic on specific file systems, e.g., after downloading the package. But this would be a recommendation, as a SHOULD requirement it would anyhow be alloed to use a folder name which is different from the identifier.

karinbredenberg commented 7 months ago

The suggestion is:

Board members acknowledgment of the issue: Tick the box in front of you name to indicate that you have looked at the suggestion.

Voting (Decision making will be carried out on the basis of majority voting by all eligible members of the Board. In the case of a tied vote, decisions will be made at the discretion of the Chair)

Tick the box in front of you name to say yes to the suggestion.

karinbredenberg commented 6 months ago

7 DILCIS Board members have acknowledge the issue 7 DILCIS Board members agree with the solution

The PR will be part of the next release of the specifications.