adlnet / xAPI-Spec

The xAPI Specification describes communication about learner activity and experiences between technologies.
https://adlnet.gov/projects/xapi/
905 stars 405 forks source link

X-Experience-API-Consistent-Through is latest timestamp, not just any timestamp #958

Closed garemoko closed 8 years ago

garemoko commented 8 years ago

fixes #956

fugu13 commented 8 years ago

I'm moving conversation to here. This PR is not correct, because that time is virtually impossible to ascertain (but a realistic time is much more possible to ascertain).

As we've discussed previously, time is a much more difficult concept with computers. A few illustrative cases might help understand the difficulties with the consistent through header.

1) Imagine an LRS, such as Wax does, uses a queue between intake of statements and when they're written to the database. This provides us huge elasticity and gives us very high uptime, since we can accept statements even if our core database is down. We set stored when we receive the statements, but don't insert them into the database until later, and not in a guaranteed order. We can't just use the max of the stored times in the database, because earlier stored times might still arrive. We set consistent through based on a pessimistic estimate derived from historical queue delays.

2) Imagine an LRS writes data immediately to its database, but it uses Elastic search. Elastic search can go into split brain (and does to varying degrees quite frequently in any long-lived cluster). Not only can it spontaneously lose data in split brain (well documented), but incoming data can go to just one of the 'halves' of the elastic search cluster, meaning queries to either of them could be missing recent data for any given query. Since detecting split brain is very hard (there's a long period where Elastic search doesn't notice), the only reasonable approach is to set consistent through based on a delay likely to exceed any undetected split brain periods, then when split brain is detected bump that up based on the details.

3) Imagine the LRS using Elastic search as previously, but you've somehow guaranteed split brain doesn't cause that issue (this is impossible without requiring a read from every server and going down when they aren't available -- which will lead to a lot of erroring queries). Elastic search is eventually consistent. There's a window for every single write before it is available for results, and due to the variation in server times between different servers, a write can become available on one server before it is available on others. You must subtract from the current time for consistent through.

4) Imaging another LRS using a strict serializability (this is much stronger than just serializability) database. None of them are (because there are very few strict serializability databases, partly). Transactions are not instantaneous. Imagine a read transaction spanning between times 4 and 23, and a write transaction spanning between times 1 and 7. The read transaction won't see the write, which might have a stored time set as early as 1, but also might have a stored time set as late as 7. The read transaction won't see it, but the max stored time it'll return is from 'previously' committed statements, which, if those stored times are being set by the database (if the web server is setting them you get none of these guarantees), is 4 or less. If it gets 3, and the write transaction gets assigned a stored time of 1, you've just violated the rules for x-experience-api-consistent-through. Since situations like that will arise all the time, the only okay value for consistent through is some delay based on pessimistic estimates of how long that overlap can be. I emphasize again that extremely few databases provide this consistency level, so for most databases and LRS set ups, for a wide variety of reasons, you'll need to provide an even larger and more pessimistic estimate (for instance, MongoDB makes a lot of claims about not having certain of these problems, but when tested, it turns out they aren't correct -- you'll need a large delay on consistent through).

I'm happy to go through more scenarios as desired to illustrate how anyone attempting to compute the value of this header from an exact response from the database is doing it wrong, as is anyone assuming they can just give the current time.

If anyone has a particular setup they'd like me to go through and help understand sources of inconsistency, I'd be happy to.

garemoko commented 8 years ago

The definition in this PR is:

...a value of the latest Timestamp for which all Statements that have or will have a "stored" property before that time are known with reasonable certainty to be available for retrieval. This time SHOULD take into account any temporary condition, such as excessive load, which might cause a delay in Statements becoming available for retrieval.

So we're already allowing "any temporary condition" which I think covers the four cases you outlined.

What this PR is clarifying is that if there are a range of timestamps between

  1. the latest point where even taking into account "any temporary condition", all statements will be available for retrieval and;
  2. the earliest point after the stored property of the last statement stored prior to point 1; then the X-Experience-API-Consistent-Through should be point 1, not point 2.

I.e if an LRS has received no statements for 100 years, the X-Experience-API-Consistent-Through header for a request sent now should be "recently", not "100 years ago".

With the current wording (before this PR), "100 years ago" would technically satisfy the requirement as "all Statements that have or will have a "stored" property before that time are known with reasonable certainty to be available for retrieval". In fact, "1000 years ago" would also satisfy the requirement as there are no statements with a stored property before 1000 years ago that are not available for retrieval.

This PR does adds the single word "latest" to make clear that the LRS is trying to maximise the value of X-Experience-API-Consistent-Through up to the limit of 'reasonable certainty' of consistency. There's nothing changing here relating to how realistic the time is.

Let me know if you disagree with any of the specifics of what I'm trying to achieve with this PR. Alternative wordings are also welcome.

fugu13 commented 8 years ago

No, only the second one is temporary, the others are ongoing facts of how the LRS works, and even the second is undetectable (at least for a time), and so must be taken into account by the LRS by default (instead of relying on detection of temporary conditions -- that is, even if the condition when it occurs is temporary the account for it can't be).

As for the example you gave, there's no negative impact between returning 100 years ago and returning anytime between then and now. So long as it is some window the LRS can guarantee it is consistent through everything will work fine -- now, if the LRS never ever updates the time then syncing against it will be difficult, but there's not much to be done about that.

The language you're proposing to introduce adds considerably more confusion over what to return given how drastically impossible it is to return a guaranteed "latest" time -- I would far rather an LRS add an additional several seconds of buffer, as any time it thinks is the latest possible is almost certainly wrong.

If we were to add language, we could add a recommendation: "This time SHOULD be fairly recent, even if there are no recently received statements" (I capitalized SHOULD, but it may as well be lowercase; it is really more advice than anything else).

garemoko commented 8 years ago

Ok, that works. I'll go for "It is expected that" in place of SHOULD.

andyjohnson commented 8 years ago

+1 , I think this is good based on the conversation, but I'd like @fugu13 's blessing.

stevenvergenz commented 8 years ago

Based on this discussion, the Experience API is not intended to be a real-time system. Is that correct? It seems to me like there are a ton of desirable use cases that require relatively low latency.

Perhaps we should consider some means of at least detecting statements in limbo, i.e. those that have been received but not processed/stored. Not in this PR, nor in this patch probably, but at some point this needs to be addressed.

garemoko commented 8 years ago

If delays whilst servers receive and store data counts as "not real time", can anything be considered to be "real time"?

stevenvergenz commented 8 years ago

Well, it's not real time if timing is not considered during processing. As written, there is no upper or lower bound on the amount of time it takes to process and store a statement, so regardless of any arbitrary threshold for "real-time" that I could give (I'd say 15 seconds like HTTP), an LRS isn't guaranteed to meet it.

As I've defined it, being "fast" isn't a requirement for being real-time though. Just provide an ETA or something.

brianjmiller commented 8 years ago

The distinction between "is not intended" and can't be is the point. Just because the spec allows there to be differentiation in implementation doesn't mean it is a fault in the spec. If your use case demands near real time result handling, then choose (or build) the LRS that satisfies your use case. Forcing this type of requirement on the implementation has largely been outside the spec's scope since its inception and it was intended that way.

fugu13 commented 8 years ago

What @brianjmiller said. I'll also note that "potentially have to wait seconds to guarantee statement visibility to all readers" is very very different from "typically wait seconds to guarantee statement visibility to all readers".

It is extremely, extremely hard to get past the first state -- extremely few databases is it even possible with, and the ones it is possible with mostly aren't used in that way since it decreases throughput and increases likelihood of rejected requests and/or outages. But reaching the second state is much more attainable (if important less often than it comes up).

fugu13 commented 8 years ago

Also @stevenvergenz x-experience-api-consistent-through is an ETA in the way you seem to be using it, AFAICT

DavidTPate commented 8 years ago

Yeah, it definitely appears to be an ETA as opposed to more of a expectation. When dealing with continually fluctuating volumes of xAPI statements it's very tough to determine with any form of accuracy when something would definitely be available.

Our clients typically have launch dates for their training where we will get hit with tens of thousands of users over the course of 30 minutes or so of them releasing their training. During this time there's scaling that goes on with our front-end Web servers (handling the incoming statements), scaling of our queueing system, scaling of the consumers of our queuing system, and scaling of our data store which all contribute to a high level of complexity in providing some form of ETA.

During the few times we are under a small load we would likely be able to state that statements should be able to be retrieved within a second, but under times of a variable load it's near impossible as you have to take into account the volume of the previous time before the statement and the volume of what will likely be coming into the LRS.

DavidTPate commented 8 years ago

So what is the purpose of the X-Experience-API-Consistent-Through header? I'm asking because I wonder if there is a different way that we could solve the problem that this is attempting to address.

To me, it seems that the purpose of the X-Experience-API-Consistent-Through header is to provide Statement Producers who are also Statement Consumers the ability to know when they would be able to retrieve a statement they sent to the LRS previously. What is the use case around them doing this?

fugu13 commented 8 years ago

x-experience-api-consistent-through is there for replication. The idea is, so long as the replication process always makes each Statements GET with a 'since' parameter at or before the time of the most recently received x-api-consistent-through, the replication process will not miss any statements as it goes. This is why I strongly, strongly recommend every LRS include several seconds of buffer above and beyond whatever they think their longest possible time will be -- because in practice such estimates are almost always wrong.

stevenvergenz commented 8 years ago

Then if there's no way to anticipate when a statement will be available programatically, for testing purposes should we simply ask the user how long the test suite should wait for a statement to be stored?

fugu13 commented 8 years ago

What x-experience-api-consistent-through does is provide that amount of time. Examine the value of Date header in the response to the statement send. Periodically run the test query. When x-experience-api-consistent-through exceeds the send Date, the statement should be present. A workable estimate of how long that will be is any recent difference between the Date and x-experience-api-consistent-through on the same response.

Since statements will often be available much sooner, the query should be run periodically with back off starting shortly after the query is sent. Input from the test runner isn't necessary.

stevenvergenz commented 8 years ago

Actually, that's not true. The Consistent-Through header is in relation to a statement's "stored" time, which is also indeterminate. Knowing the server time, or the time the statement was received, is not indicative of when it will be stored.

fugu13 commented 8 years ago

That's true, you can't use it as an exact value like I wrote, though since x-experience-api-consistent-through is already an estimate it'll be a very reasonable starting place. But I still wouldn't ask the test runner. Exponential back off with a maximum based on x-experience-api-consistent-through (twice would be plenty to convince me of test failure, but with back off the exact maximum picked shouldn't matter too much, so perhaps 10x) is the way to go.