Please describe the problem you'd like to be solved
I started this discussion under Archivematica #230, but it is more of a separate request. If filenames with diacritics and other non-ASCII characters are considered safe in the future, there might still be some need of transforming filenames, to avoid issues of the type discussed by Wheeler.
Describe the solution you'd like to see implemented
Here is a simple function in Python 3, using the Python regex library (which allows defining character classes in regexes by Unicode character properties), which might then be used as a prototype for a replacement of the santize_name function. It replaces control characters and leading dashes with underscore, but keeps e.g. diacritics.
import regex as re
def new_change_basename(basename):
NON_ALLOWED_CHARS = re.compile(r"[\p{Cc}]|^-")
return NON_ALLOWED_CHARS.sub("_", basename)
fname = "-\tLagerlöf-dagböcker\n"
print(new_change_basename(fname))
# __Lagerlöf-dagböcker_
Describe alternatives you've considered
This type of transformation may not be needed at all in the future, if stricter rules are enforced at OS level, as suggested by Wheeler.
Additional context
This should probably not be implemented before Archivematica is re-implemented in Python 3.
For Artefactual use:
Before you close this issue, you must check off the following:
[ ] All pull requests related to this issue are properly linked
[ ] All pull requests related to this issue have been merged
[ ] A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
[ ] Documentation regarding this issue has been written and merged
[ ] Details about this issue have been added to the release notes
Please describe the problem you'd like to be solved I started this discussion under Archivematica #230, but it is more of a separate request. If filenames with diacritics and other non-ASCII characters are considered safe in the future, there might still be some need of transforming filenames, to avoid issues of the type discussed by Wheeler.
Describe the solution you'd like to see implemented Here is a simple function in Python 3, using the Python regex library (which allows defining character classes in regexes by Unicode character properties), which might then be used as a prototype for a replacement of the
santize_name
function. It replaces control characters and leading dashes with underscore, but keeps e.g. diacritics.Describe alternatives you've considered This type of transformation may not be needed at all in the future, if stricter rules are enforced at OS level, as suggested by Wheeler.
Additional context This should probably not be implemented before Archivematica is re-implemented in Python 3.
For Artefactual use:
Before you close this issue, you must check off the following: