digipres / sentinel

The Sentinel watches various data source and updates digipres.org
Apache License 2.0
5 stars 3 forks source link

Validate file extensions using each sources 'native' encoding. #5

Closed anjackson closed 10 years ago

anjackson commented 10 years ago

The warnings from some of the sources are misleading, because they normalise the extensions before validating them, and not every source uses a compatible syntax.

e.g. FFW has used "command-line shell" syntax w. ? meaning one random char, * arbitrary sequence, ! $ just literals.

Whereas Tika uses ^ and $ as start and end markers.

So, better to perform the validation in the per-source code, and perform less stringent validation on the normalised form.

anjackson commented 10 years ago

Actually, this seems to work out okay. With a tiny bit of transformation the different registries can all be matched to a limited glob syntax (only ? and * are special, but not all characters allowed). A few minor tweaks to make the validation regex more permissive and all seems well.