LibraryOfCongress / bagit-spec

8 stars 7 forks source link

Some feedback on the bagit1.0 branch #3

Closed jrwdunham closed 6 years ago

jrwdunham commented 6 years ago

This issue contains feedback on the bagit1.0 branch, in particular on the ABNF grammars in section 7.

First, a small grammatical issue in the prose: Future version of BagIt may disallow should maybe be Future versions of BagIt may disallow or A future version of BagIt may disallow at https://github.com/LibraryOfCongress/bagit-spec/blob/bagit1.0/bagit.xml#L719.

ABNF Grammar feedback

The following commentary on the ABNF section of the BagIt1.0 spec, is based on an attempt to create parsers from the supplied ABNF grammars using the Instaparse library. The (still in-development) baglidate repository can be used to test proposed ABNF grammars and determine if they can be used to parse inputs that are expected to be valid.

7.1. Bag Declaration: bagit.txt

7.2. Payload Manifest: manifest-algorithm.txt

Here is the original grammar:

payload-manifest = checksum 1*WSP filename ending
checksum         = 1*hex-val
hex-val          = "x" 1*case-hexdig
                 [ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ]
case-hexdig      = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" /
                 "d" / "F" / "f"
HEXDIG           = DIGIT / "A" / "B" / "C" / "D" / "F"
filename         = (
                  "data"
                  "/"
                  *( unreserved / pct-encoded / sub-delims )
                 )
unreserved       = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims       = "!" / "$" / "&" / "'" / "(" / ")" / "*"
                 / "+" / "," / ";" / "="
pct-encoded      = "%" HEXDIG HEXDIG
ending           = CR / LF / CRLF

A provisional amended grammar is given below. (Note that it does not allow for Unicode characters in filenames, unlike the provisional amended grammar given for 7.3 below.) This grammar addresses the issues described above:

payload-manifest = 1*p-m-line
p-m-line         = checksum 1*WSP filename ending
checksum         = 1*hex-val
hex-val          = 1*case-hexdig
                 [ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ]
case-hexdig      = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" /
                 "d" / "E" / "e" / "F" / "f"
HEXDIG           = DIGIT / "A" / "B" / "C" / "D" / "F"
filename         = (
                    "data"
                    "/"
                    1*( unreserved / pct-encoded / sub-delims )
                   )
unreserved       = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims       = "!" / "$" / "&" / "'" / "(" / ")" / "*"
                 / "+" / "," / ";" / "=" / "/"
pct-encoded      = "%" HEXDIG HEXDIG
ending           = CR / LF / CRLF

7.3. Bag Metadata: bag-info.txt

Here is the original grammar:

metadata      = key ":" WSP value ending
key           = 1*alpha-numeric
value         = 1*(alpha-numeric ending) *(WSP 1*alpha-numeric ending)
alpha-numeric = ALPHA / DIGIT
ending        = CR / LF / CRLF

A provisional amended grammar is given below. Putting aside the fact that it would need revision because it uses regexes outside of the ABNF spec, it does seem to solve the issues described above:

metadata-set            = 1*metadata
metadata                = key ":" WSP value
key                     = UNICODE-VCHAR *( UNICODE-VCHAR / WSP )
value                   = value-line *( 1*WSP value-line )
value-line              = UNICODE-VCHAR *( UNICODE-VCHAR / WSP ) ending
ending                  = CR / LF / CRLF
UNICODE-VCHAR           = ( #'\p{Punct}' / #'\p{L}' / #'\d' )

7.4. Fetch File: fetch.txt

Here is the original grammar:

fetch           = url 1*WSP length 1*WSP filename line-terminator
url             = <absolute-URI, see [RFC3986], Section 4.3>
length          = DIGIT
filename        = (
                  "data"
                  "/"
                  *( unreserved / pct-encoded / sub-delims )
                )
line-terminator = CR / LF / CRLF

Here is a provisional revised grammar that addresses the issues listed above:

fetch-file      = 1*fetch-line
fetch-line      = url 1*WSP length 1*WSP filename line-terminator
url             = <absolute-URI, see [RFC3986], Section 4.3>
length          = 1*DIGIT / "-"
filename        = (
                  "data"
                  "/"
                  1*( unreserved / pct-encoded / sub-delims )
                )
line-terminator = CR / LF / CRLF
johnscancella commented 6 years ago

Thanks for the feedback, I will try and review this on monday. Have a great weekend!

johnscancella commented 6 years ago

@jrwdunham Thanks so much for the feedback, it was very helpful! Below are some comments to go along with my updates

7.2

7.3

johnscancella commented 6 years ago

@jrwdunham if you have more feedback on the changes please feel free to reopen this issue. Thanks!