Very long identifiers (data URIs)

garemoko commented 5 years ago

We've recently come across an edge case where a system has used a data URI representing as an activity id. This appears to have occurred not by deliberate design, but rather as a consequence of translating activity stream data into xAPI, perhaps without realizing that data might use data URIs.

I'm still gathering the details, but it appears like Watershed's LRS handled these very long URIs fine, but there were some problems to resolve in the reporting, which was not expecting IRIs of this length.

Previously we have said that we don't want to limit individual properties, but LRSs could limit overall statement length to a reasonable size. I wonder if we should re-visit that in the case of identifiers and also if data URIs should be specifically excluded for use as an IRI.

DataBeeGood commented 4 years ago

@garemoko
I concur. From a design perspective for interoperability and usability, it is important to consider the limitations of any technology or software which is implementing xAPI. Can't have servers choking... Some notes from a casual browsing on JSON size, strings and IRI specifications in general...

IRI - was an IETF project which folded in 2014... IRI Status from W3C Absolute resolution for IRI is required by RDF systems and some confusion on JSON-LD since 2011. xAPI would need to specify its own more modern constraints... It is actually a tough, old problem...
Several places mention Data URI as not being handled well in query strings and conversion can be very sticky between URI and IRI - something about a comma too...
IBM, AWS, Microsoft Azure, ,NET ,Apache, SPARQL all have character or size limits for parsing JSON and strings in general. 252 characters UTF-8 seems to be the most conservative. Some browsers and servers are only concerned with server memory limits and total serialized string length

thomasturrell commented 3 years ago

I believe that there should be a reasonable limit on the size of an activity id.

It should be big enough to be unique but small enough not to cause performance issues (or deliberate DOS attacks). Arguably it should be human readable.

Something downstream of the LRS (browser, load balancer, reverse proxy, web server, framework, etc) is going to limit the size of the request so the LRS does not need to accept unlimited sized statements.

I would argue that if a system is saving very large statements then it is most likely a mistake or misunderstanding.

brianjmiller commented 3 years ago

The problem with this concept is determining what a "reasonable limit" is. Your reasonable limit may not be remotely close (either too large or too small) to someone else's. It may also be a reasonable limit in 2019, or 2020, but might not in 2035, 2050, etc. The specification leaves this concept loose intentionally to allow the LRS to determine what it considers to be a "reasonable limit" by stating:

None of these requirements contradict the idea that the LRS is also allowed to be configurable to reject requests and respond or behave differently on the basis of conditions that are out of scope this specification.

...

Another condition is where the request sent is beyond the size limits set by the LRS. It would be unreasonable to expect the LRS to always accept requests of any size. The LRS can choose any size limit it sees fit, but needs to be configurable so as not to apply size limits during conformance testing. Of course, some size limits will still exist during conformance testing due to limitations of hardware, etc. but it is expected that these limits are sufficiently high so as not to affect the running of tests.

3.2

Placing an arbitrary, required limit on the size of a string, statements, or requests is short sighted. Very early on in the spec development (circa 0.9/0.95) there was an intentional use of data URIs in statement bodies. They were used to capture the oft requested "certificate" use case, granted it was part of the reason why the inclusion of attachments was added in the 1.0.0 release, but it isn't for the specification to decide what the data use cases will be. Limits of this nature were specifically left out of the specification because of the lessons learned from years being beholden to what the SCORM specification had required, for instance limits on suspend data size, or identifier lengths, that continually led to ugly workarounds.

As far as DOS attacks, the spec covers that later in the same section as above by stating:

The LRS can also reject requests or revoke credentials in case of suspected malicious intent, for example an unexpected large number of requests made in a short period of time. It is expected that that limits will be sufficiently high such that the rate of requests made during conformance testing will not trigger any rate limits.

This is specifically talking about number of requests but I think based on the rest of the language of the specification it is reasonable for the LRS to apply limits to the size of requests or data therein.

It is much harder to anticipate what "reasonable" will be within the confines of the specification, than to leave it up to the implementations to impose the limits they need to to remain functional and expect those limits to be reasonably handled upstream by the LRPs.

The statement:

Something downstream of the LRS (browser, load balancer, reverse proxy, web server, framework, etc) is going to limit the size of the request so the LRS does not need to accept unlimited sized statements.

Is actually as good (or a better) argument for the lack of need for this requirement to be in the specification. You've already stated that something in front of the LRS is going to limit size or prevent attacks, so therefore there is no reason to arbitrarily do so in the spec.

I would argue that if a system is saving very large statements then it is most likely a mistake or misunderstanding.

While it might be a reasonable argument, and frequently the case, that is easily solved by testing by one or the other of the parties in the transaction and isn't something that needs to be remedied by the spec itself. OTOH, adding an arbitrary limit for the small fraction of cases where it isn't a mistake (for some definition of "very large") means there is no remedy that can be negotiated between the two parties and a workaround has to be used.

thomasturrell commented 3 years ago

Thank you @brianjmiller, I agree with everything you have said.

adlnet / xAPI-Spec

Very long identifiers (data URIs) #1088