Base45's % symbol leads to automated decoding problems

vitorpamplona commented 3 years ago

Take this example: https://github.com/ehn-digital-green-development/hcert-testdata/blob/main/testdata/test1.png

On an Android device, open the Camera app, scan the QR and copy the text. You will get this:

HC1:NCF390:70T9WWWGERKB 4+23GO01TL2X9$S35H99CK4705XK2F3 MGNUF4F3QRCS%VLJC+6BY50.FK8ZKO/EZKEZ96446C56..DX%DZJCH/DYTCEC8PSE9EN.OEWV5K/EM-DY69TVD/.DI3DF.DVKE50EN7L+EDTW5S3E:.DW.CJWEDS8FIA71ANNAV+A3+9269+S9.+9I3DFWE2+8NB8-M86C9:R7*0A.+9GVC*JCNF6F463W5KF6VF6KECTHG4KCD3DX47B46IL6646H*6Z/E5JD�IA74R6646307Q$D.UDRYA 96NF6L/5SW6Y57B$D% D3IA4W5646946846.96XJC$+D3KC.SCXJCCWENF6OF63W5$Q6OF6WJCT3EJ+9%JC+QE1R5ZED+EDKWE3EFX3E$34Z$EXVD-NCiAECAWE1Q582BUVDGECDZCCECRTCUOA04E4WEOPCM8F6�.DA%EOPC1G72A6J+9XG7%UDL57Z1BCgIBMIAO7BLB8RUDG09/IBC48G+8D3TDY8.U5-$RK-B0-DRC99NTLZ3H LPN7D%F-*19*H9ZRT$JMXPHQ9W66$+M.2KEMAX2NR1FO9WQ/A0ZC9$27+KVBA0HT..BK3

It should have been this:

HC1:NCF390:70T9WWWGERKB 4+23GO01TL2X9$S35H99CK4705XK2F3 MGNUF4F3QRCS%VLJC+6BY50.FK8ZKO/EZKEZ96446C56..DX%DZJCH/DYTCEC8PSE9EN.OEWV5K/EM-DY69TVD/.DI3DF.DVKE50EN7L+EDTW5S3E:.DW.CJWEDS8FIA71ANNAV+A3+9269+S9.+9I3DFWE2+8NB8-M86C9:R7*0A.+9GVC*JCNF6F463W5KF6VF6KECTHG4KCD3DX47B46IL6646H*6Z/E5JD%96IA74R6646307Q$D.UDRYA 96NF6L/5SW6Y57B$D% D3IA4W5646946846.96XJC$+D3KC.SCXJCCWENF6OF63W5$Q6OF6WJCT3EJ+9%JC+QE1R5ZED+EDKWE3EFX3E$34Z$EXVD-NC%69AECAWE1Q582BUVDGECDZCCECRTCUOA04E4WEOPCM8F6%E3.DA%EOPC1G72A6J+9XG7%UDL57Z1BC%67IBMIAO7BLB8RUDG09/IBC48G+8D3TDY8.U5-$RK-B0-DRC99NTLZ3H LPN7D%F-*19*H9ZRT$JMXPHQ9W66$+M.2KEMAX2NR1FO9WQ/A0ZC9$27+KVBA0HT..BK3

Google's Camera automatically Percent Decodes the URI on copy, which means any random two-byte hex sequence that has % is incorrectly decoded to a UTF character.

In the example, %96 -> �, %69 -> i, %67 -> g and %E3 -> �

If you try Base45 Decode, because � and � are not ASCII, it doesn't work.

I am not sure what to do about it, but I bet these random encodings can crash many systems out there.

jschlyter commented 3 years ago

Thank you for the report @vitorpamplona. I'm afraid there's not much we can do about this issue. Limiting the encoding to both URL-safe and QR45-safe characters would have put us in a very limited position trying to encode binary data.

The HCERT QR-encoding was not designed to be read by the generic QR reader designed to process URLs, but to be read by dedicated validation apps and fixed gate readers. The addition of the colon in the HC1 premable was made to prohibit automatic processing when a generic QR reader is used anyway, not to enable it.

vitorpamplona commented 3 years ago

I am not sure I understand why "Limiting the encoding to both URL-safe and QR45-safe characters would have put us in a very limited position trying to encode binary data". I use Base32 and Base36 for all the other credentials out there, and it is URI safe and QR45 safe, even for binary data. It does not limit anything.

BTW, URL and URI are very different standards. QRs are never made to process URL, but URI. And EVERY QR is made to process URIs, including those at the gate readers (yes, I have shipped QRs for those in the past).

I also do not understand this "The addition of the colon in the HC1 preamble was made to prohibit automatic processing when a generic QR reader is used anyway, not to enable it." because it definitely does not prohibit anything. I don't know where the misconception relies on.

vitorpamplona commented 3 years ago

My suggestion to remove inconsistency in production is to either:

Percent Encode the Base45 content before assembling the URI as required by the URI specification
Using a different base (Base32, Base36) that does not interfere with reserved characters in the URI specification.

jschlyter commented 3 years ago

I am not sure I understand why "Limiting the encoding to both URL-safe and QR45-safe characters would have put us in a very limited position trying to encode binary data". I use Base32 and Base36 for all the other credentials out there, and it is URI safe and QR45 safe, even for binary data. It does not limit anything.

Both Base32 and Base36 where evaluated during the design phase as the QR payload would, given the test vectors available, to be too large. The goal was to encode the data for optical encoding as efficiently as possible. The payload has never been designed to be URI compatible.

BTW, URL and URI are very different standards. QRs are never made to process URL, but URI. And EVERY QR is made to process URIs, including those at the gate readers (yes, I have shipped QRs for those in the past).

Yes, my bad - I should have written URIs.

I also do not understand this "The addition of the colon in the HC1 preamble was made to prohibit automatic processing when a generic QR reader is used anyway, not to enable it." because it definitely does not prohibit anything. I don't know where the misconception relies on.

The initial preamble was "HC1" with the trailing colon. Without the colon, a phone that scans the HCERT will usually try to feed the entire QR payload to Google and friends, resulting in data leakage. The colon prohibits just that, nothing more.

vitorpamplona commented 3 years ago

Both Base32 and Base36 were evaluated during the design phase as the QR payload would, given the test vectors available, be too large.

I just tested your signing and verifying procedures with the example payload and the difference between Base32 and Base45 is 6%.

6% is not going to make or break your QRs. And if you are that worried about 6%, then you should be using something better than CBOR/ZLIB to compress the payload. You are leaving 30-40% of potential savings by choosing CBOR/ZLIB.

If you want to check for yourself, I have the 3 implementations online: https://github.pathcheck.org/eu.dgc.html

jschlyter commented 3 years ago

There are many transport encodings that may work and depending on your requirements. As stated, the HCERT was not designed to be transported as a URI and the specified transport for prefix "HC1" is Base45 - that will most likely not change now that the specification has been approved.

There's nothing prohibiting us defining a new prefix, e.g., "HC2", in the future and use a different encoding based on other requirements (such as being URI safe).

vitorpamplona commented 3 years ago

So, it looks like a URI, but it is not designed to be a URI in an ecosystem everyone expects and processes payloads as URI.

Looks like a recipe for disaster.

Seriously, you can just Percent Encode the Base45 before making the URI.

That does not change the approved specification, in the same way, adding the colon didn't change the approved language.

To the best of my knowledge, this specification work is still in very, very early stages and most definitely, no one is using it yet. I would suggest using this time as an opportunity to fix the problems you find before it gets real messy to do anything.

If you think you can't change your spec now, just imagine how you will be feeling after hundreds of millions of QRs out there.

Or don't imagine, just ask DIVOC how they feel about their 100m unreadable QRs out there.

dirkx commented 3 years ago

Sorry Victor - we cannot just escape encode de Base45.

As that means introducing the '%' character. Which is not part of the Mode 0010 of the QR code its charset.

So the introduction of just that single % means we need to switch to the full 8-bit raw mode. And of those 8 bits we then only use on average just 5.5 bit. Quandering 2.5 bit. And thus making de barcode 1/3 larger. We do not want that.

Dispensing with base45 and going 8 bit is also not an option - as then we'd break existing readers which cannot handle raw 8 bit.

in the end of the day - the payload of a QR code is a base45 string. With 11 bits for every 2 char sfrom that charset. That is simply how QR codes work. And that is its safe set (just like how in email we often go to 7-bit safe still).

If you want to turn it into a URI - you'll need to, for example; remove the HC1: decode the base45; rewrite that as a base64 url safe payload and prefix it with something else (e.g. data:).

Defining such as a payload if of course useful. And would be just like we have different transfer encodings. The current one is Base45 for a QR. There may be an 8bit-safe version, a 7bit-safe version, one suitable for URIs, etc. But we've not yet done that.

And no - we cannot easily prefix it with something that is not ':' separated of the main body. As then we'd risk this ending up in the Siri's, Google and other log-files of this world. So breaking URIs is in fact a feature here. It keeps medical PII away from the grubby underbelly of the internet.

vitorpamplona commented 3 years ago

That's wrong. % IS part of the Mode 0010 of the QR code its charset: 0–9, A–Z (upper-case only), space, $, %, *, +, -, ., /, :

vitorpamplona commented 3 years ago

If you want to turn it into a URI - you'll need to, for example; remove the HC1: decode the base45; rewrite that as a base64 url safe payload and prefix it with something else (e.g. data:).

This doesn't make any sense. Nothing requires you to use Base64...

vitorpamplona commented 3 years ago

we cannot easily prefix it with something that is not ':'

I did not suggest that. I said the direct opposite. Not sure where this is coming from.

vitorpamplona commented 3 years ago

Here's the final spec I am suggesting:

"HC1:"+PercentEncode(base45(zlib(cose(cbor(cwt(payload))))))

That solves everything. This is not rocket science.

jschlyter commented 3 years ago

The HCERT was not designed to be an URI. It may look like a URI, but it's not. It is designed to be read by dedicated HCERT validator apps and legacy QR readers. Yes, it will fail if someone tries to read it with a generic QR reader that assumes that the input is a URI. So will a boarding card with QR, or any other QR that does not contain a URI, and will most likely result in a Google search including the entire payload.

I am interested in the use-case though. What is expected to happen when a generic person scans the HCERT per your proposal?

dirkx commented 3 years ago

That solves everything. This is not rocket science.

That makes a URI. We do not want this to be a URI.

vitorpamplona commented 3 years ago

It is designed to be read by dedicated HCERT validator apps and legacy QR readers

I find it interesting that you think you can control user behavior. :)

So will a boarding card with QR

No. If you read a boarding pass QR, your phone will read it. It doesn't matter which QR scanner you use. None of them create random URI patterns that interfere with QR implementations.

What is expected to happen when a generic person scans the HCERT per your proposal?

The HCERT is just one of the thousands of different Verifiable Credential QR codes out there. Users won't even know that this QR is an HCERT certificate until they scan it. Why do you think they know which app to use?

What should happen? They should be able to read it. Directly into the Apple and Google camera if they wish to do it.

This is just good user experience design.

You don't get to decide what users will do.

vitorpamplona commented 3 years ago

That makes a URI. We do not want this to be a URI.

Then don't use a QR. Because any a:b.c/something will be read as a URI. It doesn't matter what you want.

vitorpamplona commented 3 years ago

That makes a URI. We do not want this to be a URI.

Actually, this is a great point in favor of Base32. If you don't want QR readers to read as URIs, then don't allow your spec to randomly create URIs. Very simple.

jschlyter commented 3 years ago

Can we agree on the following?

there exists QR codes with non-URI payloads.
just because something looks like a URI, e.g. some characters ":" some characters, doesn't make it a URI.
if a generic smartphone with the default QR reader (e.g. camera) scans a QR code that it can't understand and believes it is not a URI, it will Google it (usually asking the user first). This results in data leakage.

I understand you are suggesting we should have made it a URI, and given that your proposal makes sense.

vitorpamplona commented 3 years ago

Agree.
Agree.
Agree.

However:

non-URI but URI-looking payloads (half-backed implementations, which is what you have) DO NOT EXIST and are not expected by QR Scanners. No one is crazy enough to have ever proposed this idea.
.... It doesn't make it a URI, BUT it also doesn't say it is definitely NOT a URI. Ambiguity is bad. You are just creating problems for Scanners and verifiers.
HC1 is just a little hack you found that currently works. It doesn't have support from a spec. It's not a guarantee that it will not leak data. Google will Google.

And on 2: if a payload looks like a URI or includes a URI within, because QR scanners are so URI heavy, all of them will assume it is, will try to read as a URI and will make their decoding decisions based on the URI spec. Many of them will expect it to be a URI by default and some will not even have a treatment for non-URI payloads.

My point is: why are you making things harder for QR scanners and for yourselves?

You can easily go with what everybody else does and things will work. As opposed to going with these non-URI URI-looking payload that just confuses everyone and will certainly randomly crash systems in production based on how the Payload ends up looking by chance.

jschlyter commented 3 years ago

What happens when the generic person scans your percent-encoded HCERT QR (or any other URI-compliant HCERT QR)? Is "hc1:" registered as a URL handler on the phone by an already installed app? If such an app is not installed, what happens?

vitorpamplona commented 3 years ago

PercentEncoding does not change how the URL handler registration operates. Whatever happens with today's implementation will happen with the Percent Encoded one.

jschlyter commented 3 years ago

Let me rephrase; What do you believe should happen when a generic person scans a HCERT QR? Is "hc1:" registered as a URL handler on the phone by an already installed app? If such an app is not installed, what happens?

vitorpamplona commented 3 years ago

Well, it's hard to answer the question without a why. The why defines the experience, not the what.

But on average, people scan QR codes to see its content or to do something with it.

In that case:

The default camera app should read the QR, understand that this is an HCERT, unpack it, verify it, and show all the fields in a nicely designed popup over the camera preview, without going online.
After the user reviews the information, the camera app should suggest a few actions to run within the default camera app (such as a rule engine that generates a new certificate) or to send the content to the most appropriate app.

Actions can vary, such as printing the QR code again (in text or in symbol), sharing it with family, backing up, etc.

The most appropriate app can be the user's preferred Verifiable Credential wallet (which can also be Google and Apple's wallet), the user's health records app (to share with their doctors), a traveling app that is requesting proof of vaccination, a storage center where the user keeps all his records, my employer's app, so that I can get a $200 bonus for being vaccinated, a venue app so that I can get access to the game, etc.

If the user is not the holder, but a bouncer in a club, for instance, the default camera app should simply run the business' rule engine and return an anonymized signed certificate from Apple or Google that shows this person passed the health screening of the business (in whatever policies and screening procedures such business have). The derived certificate is saved with the business records.

And then, of course, if the user already has the HCERT app, the default camera app should have an option to send it to this app. I am just not aware of what the HCERT app does or why would anyone download it, so it's hard to map this use case.

All of this should happen fully offline.

As you can probably see. Most actions with the QR are done with the default camera app. Ideally, users never need to download anything else in order to do their most common tasks.

jschlyter commented 3 years ago

Although I agree with your vision, you are describing things that does not exist today as the default camera app cannot parse the HCERT contents and likely never will as the the validation process is way more complicated than you want your generic camera app to understand. If the HCERT had contained a URI and if the user had a verifier app that registered the HC1 URI scheme installed, such an app could have opened the HCERT, verified the signature against the trusted list and displayed the contents once validated against the current ruleset.

Your ideas are welcome and perhaps a future scheme (e.g., HC2) will include be encoded as a URI based on experiences from HC1. That will however not change for HC1, so the HCERT QR codes must - at this time - be scanned by the verifier app directly.

vitorpamplona commented 3 years ago

I don't think you are understanding me.

I didn't post this because of the URI schema grab to auto open the app. That would be super nice and Google and Apple are implementing it (I am the one pushing them), but this is not the discussion.

I am saying HC1 spec is wrong and needs fixing. This post is about a bug, not a wish.

But, not only that. HC1 will crash production QR readers in the field because users don't know which QR is which, will read the HCERT QR on everything, will make that "everything" crash because of the non-URI, but URI looking encoding and there's nothing you can do to block that behavior from your users.

This is not about you. It's about everybody else. You are just making everybody else's work so much harder...

jschlyter commented 3 years ago

We do understand you and we're currently faced with two options:

Do nothing and hope that generic QR code readers are robust enough to cope with (for them) bad input. We have seen none that can't handle the HC1 HCERTs, but that doesn't mean there aren't any.
Change the EU standard for Electronic Health Certificates with two months until full deployment. Such as change would most likely make us switch to Base32.

We're still discussing, but any changes at this point is non-trivial.

vitorpamplona commented 3 years ago

Thank you for considering the change.

And apologies for the aggressive tone. I do think this is important.

Also, I just recommended changing the Date and DateTime schema encodings to save up to 10% of space in the Test QR. You can use that to offset the size loss of 6% if you shift to Base32.

jschlyter commented 3 years ago

Btw, why would an app register a HC1 URI scheme if they know this is a problem? It's not like there is anything to parse unless you know how to decode the HCERT. Nothing in the HCERT is human readable without a proper parser with validation.

vitorpamplona commented 3 years ago

It's very easy to decode HCERT. I coded our HCERT verifier in 4 hours without knowing anything about CBOR and COSE (I have used LD-Profs in the past, but not COSE-based certs).

If it is that easy and enhances user experience, then the question becomes, why not?

The only issue I see is that COSE's Verify function is super slow (up to 1 second on a mobile phone) for reasons I still don't know. Any other signature verification usually runs on the 10-20 milliseconds.

I think the problem of passing Apple's and Google's Store reviews to justify the HC1 grab is much more difficult (weeks of back and forth) than the implementation of the decoder.

vitorpamplona commented 3 years ago

You might have seen it already, but Apple's implementation is out: https://developer.apple.com/videos/play/wwdc2021/10089/

Importing "SHC:" URIs into the Health App from the main Camera App

chrysn commented 3 years ago

Apart from the topic of whether it's intended as a URL, Carsten Bormann made a good observation on the CBOR/COSE list: Base41 would be, in terms of encoding efficiency when combined with QR codes, just as efficient as base45. (There is 30% loss we suffer from QR readers not being fully binary safe; using base41..base44 shifts these 30% around to using less of the alphabet and instead having fewer invalid trigraphs composed of alphabet letters).

Removing the space and % characters would solve this issue and #72; removing $ and * would also remove other characters that are occasionally magic. The result would probably even pass for a URI in generic parsers (at least with the hc1: prefix that ensures it's of the URI form), because +-. are gen-delim / unreserved, and while / and : do have special meanings, if / stays the last character (where trigraphs starting with // are already >65535) there will never be something starting with //, so the big syntactic show-stopper of running into anything that reads like an invalid authority (non-numeric port, invalid IPv4 address) won't happen. (But I'd guess QR readers would let even that through). The remaining oddity from a URI parser's view would be the potential occurrence of a // later on, but AFAIR that already falls into the scheme based normalization rules.

(Not that I'm saying that these things should be URIs, but given the unfortunate fact that QR codes are untagged in practice, at least it can now look like one).

[edit: Even a single initial slash won't occur as 404141 > 65535 -- so when parsed as a URI all something:base41 would read in ABNF as scheme ":" path-rootless. Then, all should indeed be fine according to the ABNF, as there is a leading pchar for the segment-nz and then just slashes and segment-nzs. Replacing the slash with * or $, which are both sub-delims and thus pchar, might make things even easier.]

vitorpamplona commented 3 years ago

Good point!

Additionally, some URI processors delete the last / when copying. That would break the Zlib decompression when the last char is /.

Another option is to do Base36 in uppercase. It's a more stable base than Base41 and it removes even further conflicts with reserved characters of the URI standard. Base36 has evolved well in the last months and it's even used in the Multibase standard some crypto projects use: https://github.com/multiformats/multibase

I have had so many problems with the implementations of Base45 that I don't even consider it a viable option anymore. :(

chrysn commented 3 years ago

Additionally, some URI processors delete the last / when copying.

That's terrible behavior on their side, as that even alters the meaning of HTTP URIs.

Anyway, leaving one of $ in and * and removing / instead would alleviate that.

Base36 has evolved well in the last months and it's even used in the Multibase standard some crypto projects use: https://github.com/multiformats/multibase

The multibase specification is developed at the same venue as base45, in independent individual drafts, so maybe the authors would care to sync up?

vitorpamplona commented 3 years ago

Additionally, some URI processors delete the last / when copying. That's terrible behavior on their side, as that even alters the meaning of HTTP URIs. Anyway, leaving one of $ in and * and removing / instead would alleviate that.

I agree. Don't even get me started on that side... :)

Multibase would include Base45. But they will continue being different things. Multibase is probably just waiting for Base45 to mature.

jschlyter commented 3 years ago

As I believe everyone is aware of, multiple reasons and constraints where behind choosing base45. Once some of the constraints where changed the rollout process had gone too far to motivate a change and it was decided that we stay with the current encoding for HC1. Keep in mind that we're in full production on July 1st this year.

Would we change it if would do it again? Most likely yes. Personally, I would have look at base32 given that the payloads are far smaller than we initially designed for, but YMMV.

Are there great benefits for EU citizens to change it now and delay the entire rollout of creating QR codes and distributing validation apps to border control and other checkpoints? No, it works good enough as the codes are not designed to be scanned by generic phone apps, but designed to be scanned by dedicated verification apps.

I would claim that change at this point (or even a month ago) would be disruptive and put the entire roll out at risk. With that said, I will close this issue. If you want to propose a new encoding method, e.g. HC2, please open a new issue.

chrysn commented 3 years ago

I would claim that change at this point (or even a month ago) would be disruptive and put the entire roll out at risk.

Understandable.

Please add a note on this to the base45 draft when crunch time is over. (Even if it doesn't make it through the full RFC process, that document will be found by people who put stuff into QR codes, and just a note that "this is not followed on because X and Y caused trouble (but it is used as-is in Z for roll-out reasons)" will help the next people).

rasa commented 2 years ago

Forgive my asking so late in the game, and I am not being critical, just curious, but why was 45 characters chosen to encode 2 bytes as three 8-bit characters, when 41 characters would have been sufficient (41**3>65536)? Then you could have used a URL-safe (and filename safe) encode string of, say, 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ!.-_~.

jschlyter commented 2 years ago

@rasa Because of different requirements and use cases at the time of design, Base45 was choose. As you can read above this may no longer be valid reasons, but there's nothing we can do about it now.

dirkx commented 2 years ago

@Rasa -- Aye - fair point and good point. Something not considered.

Mostly as we initially in NL were using a true Base45 (much like the https://tools.ietf.org/id/draft-msporny-base58-01.html) which uses the combinatoric space to the last bit. (https://github.com/ehn-dcc-development/base45-java/tree/base45-tight, https://github.com/ehn-dcc-development/base45-ansi-C/tree/experiment/efficient-packing) - and only later simplified this to what SE was using.

Good thing for a future version (either go 41 chars -or- use a real base45.

jpph commented 2 years ago

As a side note, the qrcode need to be handled by a certified country specific app. For example in France, verifier can be fined if they access information like the vaccin date, type etc... The app just show that the pass is OK. So this can not be handled like a uri.

chrysn commented 2 years ago

Whether or not data should be read is unrelated to how it is encoded; if anything, a standards-based encoding can do a better job at pointing the reader towards the applicable rules compared to data just sitting there. (So, say, had this all been dqr:hc1:${base41encoded} and the qr structure been universal, apps would still see the hc1 and decide what and whether at all to process the data. Any policy enforcement deemed necessary can then be done more reliably done in the reader).

But as has been pointed out, the ship has ... shipped for this application, and let's all meet here again when there are pointers to either new applications that need something like this, or to are concrete proposals how the next application could do it.

ehn-dcc-development / eu-dcc-hcert-spec

Base45's % symbol leads to automated decoding problems #64