archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: Archivematica doesn't preserve diacritics in filenames #1084

Open klpn opened 4 years ago

klpn commented 4 years ago

Please describe the problem you'd like to be solved I started this discussion under Archivematica #230, but it is more of a separate request. If filenames with diacritics and other non-ASCII characters are considered safe in the future, there might still be some need of transforming filenames, to avoid issues of the type discussed by Wheeler.

Describe the solution you'd like to see implemented Here is a simple function in Python 3, using the Python regex library (which allows defining character classes in regexes by Unicode character properties), which might then be used as a prototype for a replacement of the santize_name function. It replaces control characters and leading dashes with underscore, but keeps e.g. diacritics.

import regex as re
def new_change_basename(basename):
    NON_ALLOWED_CHARS = re.compile(r"[\p{Cc}]|^-")
    return NON_ALLOWED_CHARS.sub("_", basename)

fname = "-\tLagerlöf-dagböcker\n"
print(new_change_basename(fname))
# __Lagerlöf-dagböcker_

Describe alternatives you've considered This type of transformation may not be needed at all in the future, if stricter rules are enforced at OS level, as suggested by Wheeler.

Additional context This should probably not be implemented before Archivematica is re-implemented in Python 3.


For Artefactual use:

Before you close this issue, you must check off the following:

sromkey commented 4 years ago

@klpn thank you for this- I hope you don't mind, I adjusted the issue title to reflect the problem at hand.