LibraryOfCongress / bagit-spec

8 stars 7 forks source link

Only CR, LF and % are %-escaped #17

Closed stain closed 6 years ago

stain commented 6 years ago

tl;dr

Only do RFC3986-style %-escaping of CR, LF and (new) %.

Discussion point

In the current text there is a discrepancy, the ABNF proposes only a limited set of sub-delims, e.g. # ^ or æ are not allowed, with pct-encoding suggesting a generic RFC3986 URI byte encoding (presumably in UTF8?).

However the current spec only seems to permit %-encoding if there is a CR/LF in the filepath is, in which case the whole filepath is escaped. I get the feeling the intention was that only those particular characters were to be %-encoded.

We either need to allow escaping of any characters (in which case % must always be escaped, so consumers can know if a filepath string is escaped or not), or only have a limited search-replace escaping of particular control characters - which for some reason is limited to just CR LF (what about \t ?).

But even in the second case (which this PR is suggesting), a filename that on disk is called fred%0a.txt would be impossible to capture in bagit as in the manifest it would be indistinguishable from a filename fred\n.txt -- and so also here the % should be escaped as %25.

I believe we want to stay closer to the md5sum output (as referenced in the spec) than the more general URI format, so say ` should not be escaped to%20` - although that would be a more predictable URI-style escaping that actually follows RFC3986.

Also I don't think we want to suggest that all non-ASCII characters should be %-encoded, which is what RFC3986 would say - rather they should be represented directly in the declared Tag-File-Character-Encoding.

justinlittman commented 6 years ago

Concur.

stain commented 6 years ago

Good, that makes some things easier.

What about the ABNF for manifests and fetch.txt, it is still quite restrictive on the filename:

filename              = "data/" 
                        1*( unreserved / pct-encoded / sub-delims )
unreserved            = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims            = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" /
                        "*" / "+" / "," / ";" / "=" / "/"   

I would think given Tag-File-Character-Encoding then any character from that (except CR, LF and %) would be allowed. Something similar to non-reserved later?

non-reserved  = VCHAR / WSP 
                ; any valid character for the specific encoding 
                ; except those that match "ending"

I think addressing that goes for a different PR though once this one is clear.

johnscancella commented 6 years ago

@stain once you fix the merge conflict I will merge this. Thanks!

johnscancella commented 6 years ago

fixed merge conflict and merged manually