LibraryOfCongress / bagit-spec

8 stars 7 forks source link

changed the bagit.txt example to make it more clear that you can choo… #14

Closed johnscancella closed 6 years ago

johnscancella commented 6 years ago

…se which encoding the other tag files are in As per comment https://github.com/jkunze/bagitspec/pull/19#issuecomment-383022387

johnscancella commented 6 years ago

@stain Hopefully this makes it more clear that you are just specifying what the other tag file encoding is.

But on second reading do I understand that you still want to allow any encoding (without saying where that encoding name is defined) - and that it is the bagit.txt file itself that is the only one that must be UTF-8? (why not ASCII?)

I would not mind voting for fixed UTF-8 for Bagit 1.0. This has become the norm for most formats like XML, JSON. If we allow other character encoding we must say which registry we refer to, otherwise arbitrary encoding strings like "code page 865" would be allowed.

We don't use ASCII because of historical reasons. The BagIt 1.0 specification was just to get the specification ready for official RFC status. We are planning on doing a 2.0 release where we will have many breaking changes. One of them is just defaulting to UTF-8 for all tag files.

stain commented 6 years ago

This was actually new to me, I misunderstood the previous text. This PR makes it clearer. Should we say "used by the remaining tag files" to make it more obvious that this does not apply to bagit.txt itself? (which would have been tricky)

UTF8 of bagit.txt is effectively ASCII (as BOMs are not allowed) unless you want to support ENCODING values that use non-ASCII characters, or non-numeric version numbers. But I'm OK with keeping it UTF-8 for consistency.

The values for ENCODING must come from a known registry and not be free-form -- we don't want:

Tag-File-Character-Encoding: Stian☃

I don't see a reason not to re-use the rfc2978 registry? So let's just cite "A character set name registered according to [RFC2978]".

acdha commented 6 years ago

+1 on referencing RFC 2978.

I'm also thinking that it would be good to recommend UTF-8 since it's 2018 and using anything else is almost always either legacy compatibility or a mistake. Perhaps something along these lines?

ENCODING &should; be UTF-8 but for backwards compatibility it &may; be any other encoding registered in RFC2978

justinlittman commented 6 years ago

Concur with @acdha's recommendations.

johnscancella commented 6 years ago

superseded by #27