Open luis100 opened 1 year ago
Reference of the list of characters and reserved file names with should avoid to keep compatibility in several operative systems: https://stackoverflow.com/a/31976060/10386423
Note that there are many other folder names that are illegal, for example: using only spaces. I recommend looking into a solution that would ensure all possible cases, like URL encoding the OBJID in the folder name.
On commons-ip we have gone forward and using URL encoding to encode folder illegal characters in the top directory, but this may create SIPs that would not pass validation with other validators. Guidance on the specification is needed to ensure all validators treat this cases in a similar way.
It seems to me like this should be mostly a matter of being more careful in the wording of CSIP1, mainly the second half of its description ("i.e. the name of the package root folder."). The related structural requirement CSIPSTR2 has a cardinality of SHOULD, which would allow for the information package root folder name to differ from the package identifier. I believe the intention here is that both should be the same if possible, but file system limitation are a prominent reason where that isn't always the case.
A validator should therefore not treat this as an error, allowing in principle for differing approaches to filename sanitization where needed.
Offering guidance on this in the spec would probably be a good addition. The description for CSIP1 should be adjusted, e.g. by adding a small qualifier like "commonly" and referring to CSIPSTR2?
I discussed this with @prettybits and suggested that a character mapping could be recommended, such as the one that is included in the Pairtree specification, section 3 "Identifier string cleaning". Relevant parts of this character mapping approach could be recommended to avoid characters which are problematic on specific file systems, e.g., after downloading the package. But this would be a recommendation, as a SHOULD requirement it would anyhow be alloed to use a folder name which is different from the identifier.
The suggestion is:
Board members acknowledgment of the issue: Tick the box in front of you name to indicate that you have looked at the suggestion.
Voting (Decision making will be carried out on the basis of majority voting by all eligible members of the Board. In the case of a tied vote, decisions will be made at the discretion of the Chair)
Tick the box in front of you name to say yes to the suggestion.
7 DILCIS Board members have acknowledge the issue 7 DILCIS Board members agree with the solution
The PR will be part of the next release of the specifications.
An OBJID is a string, but not all characters can be created as a folder name, for example:
\/:
But, simultaneously, OBJID can have meaning in the production system, where it would be good to maintain the same characters.
In commons-ip, we have worked around this situation by allowing any OBJID, but then replacing any illegal characters with
_
.But this conflicts with CSIP1, where:
For the package METS document, this should be the name/ID of the package, i.e. the name of the package root folder.
.We need an exception or approach for illegal folder characters, like replacing or encoding these characters.