OAI / oascomply

Apache License 2.0
22 stars 6 forks source link

URI/IRI library #7

Closed handrews closed 1 year ago

handrews commented 1 year ago

URI and IRI support in python is a dumpster fire of sadness. It is "good enough" for most day-to-day purposes, but for a project specifically focusing on correctness and compliance, that's not really good enough. Considering some of the more commonly used libraries:

rfc3986 2.0.0 also emits deprecation warnings when you use certain non-deprecated functions. I have contributed a fix for this, but it's unclear when they will get around to publishing a new release.

For IRIs, all of the above means that non-URI unicode characters are encoded, rendering the IRI unreadable as an IRI. It's not entirely clear if there's any need for true IRI support for OAS 3.x compliance, but presumably it will be a concern for 4.0. The options for supporting unencoded IRIs is very sketchy:

None of this even gets into scheme-specific parsing, or media-type-specific fragment parsing.

To the best of my knowledge, the only code that handles everything correctly is the abnf package, which includes ABNF support for RFC 3986 and 3987, but produces a huge parse tree as a result and is probably significantly slower (although I haven't measured it). Its error reporting is also incomprehensible.

To add further confusion:

So at minimum there are two URI-ish classes floating around the code, one of which wraps a third, none of which are the standard library, and all of which have different interfaces.

┻━┻︵ \(`Д´)/ ︵ ┻━┻

Clearly the only answer is to write another library. More seriously, wrap this dumpster fire in a facade so that we can be less fiddly but still change implementations until finding the right correctness/convenience/performance balance. We are unlikely to know what constitutes "reasonable" performance until the system is near-complete and can process enormous specs like the GitHub API. A facade will make it feasible to quickly measure different alternative implementations.

Requirements:

handrews commented 1 year ago

Looking more closely at rfc3987, the function I tried defaults to non-validating parsing, but it's trivial to enable validation, which seems to be appropriately strict. It seems like the best candidate for validation, but it does not offer any sort of convenience class packaging up relevant functionality.

handrews commented 1 year ago

New classes wrapping rfc3987 with a more rfc3986-ish interface were included in PR #16.