Attempt to add 'standard' base64 bytes support

emarcotte commented 11 months ago

https://github.com/Byron/google-apis-rs/issues/442, https://github.com/Byron/google-apis-rs/issues/445, and https://github.com/Byron/google-apis-rs/issues/444 all flag an issue where some APIs respond with non-valid base64 bytes values for the "URL safe" flavor of configuration.

This adds support for a "standard" wrapper adjacent to the URL safe one with the intention of finding a way to flag which structures should use which configuration.

I'm still trying to figure out how to properly test this and I have no clue how we could mark some as URL safe and some as standard. Open to any tips!

emarcotte commented 11 months ago

I ran the regenerate process and pushed everything to https://github.com/emarcotte/google-apis-rs/tree/regenerate, then pointed a contrived program to it which was able to successfully read a secret that would not decode as url-safe.

Byron commented 11 months ago

Fantastic! I think the solution could be (instead of relying to have per-API settings for this which need to be created by hand in absence of a way to infer this) to have a wrapper that tries both encodings and fails only once both attempts fail.

That way we only have to choose the most common encoding to be tried first in order to avoid too many wasted cycles (even though it would probably detect early when the encoding doesn't pan out so failure is cheap anyway).

What do you think?

emarcotte commented 11 months ago

I'll give it a shot sometime this week. I'm not an expert in base64 so my only question would be if there's any possible scenario where we do an incorrect, but non-erroring parse.

BTW; any hints on that clippy lint?

Byron commented 11 months ago

Thanks for your help!

I'll give it a shot sometime this week. I'm not an expert in base64 so my only question would be if there's any possible scenario where we do an incorrect, but non-erroring parse.

Me neither, but: the question here is if decoding could succeed by one but produce garbage. And I'd think the answer for this is strictly: no. Of course, this might just be wishful thinking, but I am inclined to believe it's true and not worry about it.

BTW; any hints on that clippy lint?

Each time the Rust version increments clippy learns new tricks, which can lead to a failing CI once the runner gets the upgrade. You can probably easily reproduce it locally in the crate itself (it's not generated) and fix it as suggested by clippy itself.

The none-lint failure is more of a problem though as it seems python related (Oh I have so learned my lesson when it comes to python: do not use for anything that has to last).

Thanks again.

emarcotte commented 11 months ago

On the "fetching" side I think a fallback approach can work but I am not so confident that on the sending side. If say PubSub msg payloads need to be URL safe then we'd be posting invalid data which is a bit less ideal than burning cycles in decoding.

Would you expect that we only do the fallback on reading side?

BTW I did finally realize google standard APIs CLAIM to be URL safe variant but clearly that isn't followed very precisely (or the docs are bad).

Byron commented 11 months ago

On the "fetching" side I think a fallback approach can work but I am not so confident that on the sending side.

Right, I agree! The fallback would only work for the reading side. Getting it to work for sending would mean there is a mechanism for the user to configure this, so the user would have to figure it out anyway. And if that's the case, whatever the user configures might as well be used for the receiving end and avoid having to guess anything.

The question really is if the user has a way to know. Is there something in the API description that indicates what is what? I am pretty sure that the generators that Google does provide for other programming languages have figured this out, so maybe one could learn from there?

emarcotte commented 10 months ago

Haven't forgotten this, just been swamped. Been trying to research how go and python clients work and mostly just getting more confused how this fixed anything and why the original code didn't work...

Byron commented 10 months ago

just getting more confused how this fixed anything and why the original code didn't work..

I think this is related to some APIs acting differently, but I also never validated this myself. So me it seemed the fix broke more than it fixed and maybe this tells use we should revert first and then try to fix the original issue another way.

If you agree, please feel free to open a revert-PR for that.

andrewbaxter commented 8 months ago

Sorry, just dropping in. I assume the issue is random apis use random different base64 standards, and it's not part of the metadata? It could be like 50/50 urlsafe vs standard?

What about one of the following:

A base64 decoder that tries both standard and urlsafe? Or tries to normalize first (swapping underscores or whatever out)? Is this just decoding? Maybe gcp is permissive in which b64 it accepts so the requests don't matter, just the responses?
Maintain a separate list of per-api metadata including which base64 style is used. Making MRs for individual apis which have issues as I (we?) encounter them doesn't seem too hard. The initial metadata could be 100% base64.
Don't decode, or provide separate methods that don't decode, so the user can choose to deal with it themselves.

andrewbaxter commented 8 months ago

Uhh, so AFAICT the golang go library uses the standard go json package, which always uses standard (not web-safe) base64 encoding and decoding. This directly contradicts their own guidance but well

Byron commented 8 months ago

Maybe it can just be switched back to that then. From my experience, after it changed to what it is now the number of issues multiplied - so what was a fix for few turned out to be a problem for many.

Maybe you can find the commit and revert it, which should then be standard base64 as well.

Alternatively, figure out a way to let the user handle encoding and decoding to a greater extend.

By the way, I am very sorry you all have to suffer through this and my hands-off approach to the project. Too many people use it and it's in a barely-community-maintained state - right now it's not good enough. I simply can't understand why Google doesn't provide their own bindings for Rust.

andrewbaxter commented 8 months ago

Looking at python, there's two bytes handlers registered here: https://github.com/googleapis/proto-plus-python/blob/161f89f5b7a6f493602ed8890ee9a11ebbc77836/proto/marshal/marshal.py (I think this is the library python uses)

The first one comes from googles protobuf library, and I think the "bytes" type has id 18. The 2nd one is defined locally, with id 12. I think maybe the python methods might be automatically generated based on protobuf specs?

The first one uses standard for encode, and urlsafe for decode :question: https://github.com/protocolbuffers/protobuf/blob/a7b0421c78412baa48704d727601a17ac0f451d1/python/google/protobuf/json_format.py AFAICT ... that's using id type 12 though. Maybe I read the 18 thing wrong.

The 2nd one uses urlsafe.

andrewbaxter commented 8 months ago

One wonders why google does a lot of things

andrewbaxter commented 8 months ago

And this is way better still than writing it from scratch, especially if you don't mind reviewing and merging!

andrewbaxter commented 8 months ago

I'm going to try some read/write api with a couple libraries and see what goes over the wire.

andrewbaxter commented 8 months ago

So doing a file upload with checksum=crc32c

Golang sends

"crc32c":"ksgKMQ=="

receives:

"Etag":[]string{"CNGDo77Aw4MDEAE="},
...
"md5Hash":"4vxxTEcn7pOV8yTNLn8zHw==","crc32c":"ksgKMQ==","etag":"CMTc+/Gzw4MDEAE=",

Python sends

"crc32c": "ksgKMQ=="

receives

header: ETag: CPic/oa8w4MDEAE=

but no body because apparently in python it's impossible to log http response bodies.

There is a lot of explicit encoding/decoding though in both, stuff that looks like it wasn't automatically generated (from protobuf or otherwise). The checksum is like that, so it might not be a good example.

andrewbaxter commented 8 months ago

As someone who seems deeply familiar with implementing GCP api libraries, if you don't mind a totally out of left-field ping @codyoss - would you be able to lend advice on what base64 encoding is canonical here? (or maybe confirm that it's a case-by-case basis thing)

Otherwise I think:

Revert (or switch to standard). I'll try to look up that commit
Add a mechanism for pinpoint overrides

Would be the safest...

andrewbaxter commented 8 months ago

Was this the commit? https://github.com/Byron/google-apis-rs/pull/379 - If I'm reading that, before it was just String etc, so neither urlsafe nor standard base64.

In that case I'm not sure overrides make sense. I kind of do feel reverting is safest and easiest here, and having String is still a far cry better than having to set up http requests manually. If there were a clear path to decoding that would be awesome though.

Byron commented 8 months ago

Thanks a lot for your help, I will look at the PR in a bit!

Add a mechanism for pinpoint overrides

Such a mechanism exists in principle, as each API can have overrides to any of its properties, but the implementation would have to be adjusted to use that to drive the decoding or encoding of values.

With #463 that can probably be postponed indefinitely though.

andrewbaxter commented 8 months ago

FWIW I made #463 . Having base64/datetime automatically parsed would be awesome but I don't have any great ideas. Maybe if someone else wants to tackle it it could be un-reverted.

codyoss commented 8 months ago

@andrewbaxter can you provide me with a little more context what are you encoding/decoding? It depends on which part of a req/resp you are dealing with which type to use.

Generally speaking message bodies: standard RFC 4648. query params: url encoding defined in the above rfc authy JWT stuff: unpadded URL encoding

(all off the top of my head)

andrewbaxter commented 8 months ago

Oh yeah! Thanks for taking the time to respond!

I think it's just the JSON request/response bodies from the APIs so it wouldn't include query parameters or auth. The generation comes from, for example: https://texttospeech.googleapis.com/$discovery/rest?version=v1beta1 .

https://developers.google.com/discovery/v1/type-format says

A padded, base64-encoded string of bytes, encoded with a URL and filename safe alphabet (sometimes referred to as "web-safe" or "base64url"). Defined by RFC4648.

So the assumption was that this should be web safe base64 everywhere, but people encountered various APIs which are returning standard base64 strings. AFAICT the golang implementation seems to contradict that and is nearly all standard base64, but there are a few things that appear to use urlsafe, and python appears to use urlsafe if I read the code correctly...

I was wondering if you could give a hint on which one is correct, or if you had hints on the correct way to deal with base64 in json in an API implementation.

So you think standard encoding would be the safest to use here? Do you know if there are a lot of apis that deviate from that (i.e. few enough to be fixable by hand)?

codyoss commented 8 months ago

AFAICT the golang implementation seems to contradict that and is nearly all standard base64, but there are a few things that appear to use urlsafe, and python appears to use urlsafe if I read the code correctly...

You are correct for our Go libraries. I can't speak for the python ones -- I assume some libraries have either fall back behaviour or just handle bytes in different ways at a language level.

A good test case for handling bytes is a body is a pubsub message receive RPC.

I think it's just the JSON request/response bodies from the APIs so it wouldn't include query parameters

Some fields laid out in discovery are query params and not a part of a body

So you think standard encoding would be the safest to use here?

Yes

andrewbaxter commented 8 months ago

Perfect, thanks @codyoss ! In that case @Byron would you mind if we do this?

Merge this MR
I can commit to helping adding override functionality if other base64 issues appear in say the next year (i.e. if some things do need urlsafe base64)

I just generated the urlsafe code from this branch locally and I'll do some testing. I'm not sure if there are tests that need to be fixed, it looks like the jobs expired - are you able to rerun the tests? Or I could make a new MR (unless @emarcotte wanted to continue this).

Byron commented 8 months ago

Actually it doesn't let me re-run the tests even, so I think it will be easiest if you would build up on this PR and make it work as much as possible.

Thanks so much for pushing this fix to completion.

emarcotte commented 8 months ago

Hey guys, sorry i've been absent for months, I've been completely slammed by Real Life (TM). I had intended to just go back to byte value but never found time :(

Edit: bytes, not string, brain is fried :(

andrewbaxter commented 8 months ago

No worries! Cool, I'll take it over then if that's alright.

andrewbaxter commented 8 months ago

I made #464 with a new build, no new changes

Byron commented 8 months ago

Great! After all, the PR could be merged and apparently we all lost our track here in this PR somehow. Thanks @andrewbaxter for guiding us to completion.

Before re-releasing all crates I think it's worth to take another look and make small last-minute changes in the next couple of days. Then when deemed ready, I can cut the release.

Thanks everyone.

andrewbaxter commented 8 months ago

Awesome, thank you too for help with the merge! Sounds good to me.

Byron / google-apis-rs

Attempt to add 'standard' base64 bytes support #446