Deleting voided and voiding statements

garemoko commented 7 years ago

(Somewhat related to https://github.com/adlnet/xAPI-Spec/issues/995)

In practice, it's possible that an LRS could find itself with a very large number of voided and voiding statements to the extent that it creates a lot of noise in the system and can have a tangible impact on performance.

I recall a conversation on one of the spec calls where we said that an LRS would be able to delete voided and voiding statements after some time, but I don't think that made this into the spec.

In real-world scenarios where only one LRS is involved, a pragmatic LRS vendor can simply remove the statements and hope the 'xAPI police' never find out with no negative consequence.

In scenarios where two LRSs are involved that are both sending statements to one another, you might initially think that the voiding statements, if deleted by LRS A would simply be recreated when sent back by LRS B. However, if the LRS are pulling statements by since property and at least 1 cycle has completed before LRS A deletes the statements after having sent them to LRS B, then the voiding statements would never come back.

So in summary: it's OK for an LRS to delete voided and voiding statements so long as it waits sufficiently long enough for the voiding statements to propagate throughout the ecosystem it is a part of.

Do we all agree with this? Should it be documented in the spec?

DavidTPate commented 7 years ago

I don't agree with this at all as it no longer ensures that the LRS has a complete set of statements. It makes assumptions about the context of the statements that since they are voided they do not matter (nor do their voiding statements). Additionally it violates the current specification implications:

The certainty that an LRS has an accurate and complete collection of data is guaranteed by the fact that Statements cannot be logically changed or deleted.

If instead we wanted to make it so that we treat Voiding statements another way or don't generate such noise then we should look at how we currently do things. I could see some other ways of us handling things:

Similar to the LRS not returning voided statements by default we could also have it not return voiding statements by default. (easily identifiable by the voided verb).
Instead of having to send xAPI statements to void statements we could have a route to do so. Something like DELETE /statements/{statementId} or something else that doesn't require xAPI statements to void xAPI statements.

aaronesilvers commented 7 years ago

@garemoko I have to agree with @DavidTPate on this, in terms of it violating the current specification. This might be something we want to explore for a 2.0, as I see this (on the face of it, at least) as a serious challenge to the immutability of statements.

I am sensitive to the pragmatic needs of vendors, data scientists and others in reducing the noise in large data sets. If we were to agree, in practice, to adopt practices such as what's suggested by @DavidTPate, that is something we can explore in terms of certification requirements

From a spec conformance perspective,

"it's OK for an LRS to delete voided and voiding statements so long as it waits sufficiently long enough for the voiding statements to propagate throughout the ecosystem it is a part of..."

My questions: How would we test for that? What constitutes a "sufficiently long" time?

andyjohnson commented 7 years ago

I think it is good to have the discussion about reasonable lifecycle expectations of Statements, but agreed that it is a 2.0 or even potentially a separate spec altogether.

On Wed, Oct 5, 2016 at 9:46 AM, Aaron E. Silvers notifications@github.com wrote:

@garemoko https://github.com/garemoko I have to agree with @DavidTPate https://github.com/DavidTPate on this, in terms of it violating the current specification. This might be something we want to explore for a 2.0, as I see this (on the face of it, at least) as a serious challenge to the immutability of statements.

I am sensitive to the pragmatic needs of vendors, data scientists and others in reducing the noise in large data sets. If we were to agree, in practice, to adopt practices such as what's suggested by @DavidTPate https://github.com/DavidTPate, that is something we can explore in terms of certification requirements

From a spec conformance perspective,

"it's OK for an LRS to delete voided and voiding statements so long as it waits sufficiently long enough for the voiding statements to propagate throughout the ecosystem it is a part of..."

My questions: How would we test for that? What constitutes a "sufficiently long" time?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/adlnet/xAPI-Spec/issues/1003#issuecomment-251679124, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQ364Wa3bmsL1Nn96v_oA5UiKCw8Av3ks5qw6o8gaJpZM4KOl54 .

Andy Johnson ADL Technical Team 608-318-0049

garemoko commented 7 years ago

Noise is one aspect but there's also the potential performance drain and real financial costs relating to having a large number of statements half of which contain errors and the other half of which are simply to note that the first half contain errors. 'Large' could be extremely large in some cases, for any definition of extremely large you might have.

Sufficiently long would be determined by the system, but in an ecosystem of two LRSs it would be until LRS A had queried statements from LRS B beyond the stored time of when the voiding statements it send to LRS B would have been stored (based on the consistent through time). That's testable, although this would be a MAY requirement, which we haven't normally tested.

In fact, right now I'm not sure we could test that an LRS isn't already doing this since the conformance test suite runs over a non-infinite time period, there's nothing to say the LRS didn't delete the statements after the test completed.

DavidTPate commented 7 years ago

You're making an assumption here that statements are only voided when they "contain errors" but that isn't the only use of a voiding a statement.

garemoko commented 7 years ago

It's certainly one use - why does it matter that there could be other uses? (And why would you void something that's not wrong?)

DavidTPate commented 7 years ago

Yeah, it's definitely one use but if this is a truly interoperable system then each of the systems will work similarly (hence the specification). Have some LRS's which remove Voided messages after lets say 1 year and others that don't (or do other periods) just serves to fragment the interoperability in my opinion as it would not longer be predictable.

Statements aren't just voided because they "contain errors", we even state that in the spec currently:

However, not all Statements are perpetually valid once they have been issued. Mistakes or other factors could dictate that a previously made Statement is marked as invalid.

In our case we have some uses where the statements are voided because they no longer apply (but are factually correct) along with some other uses based upon state. The thing is that the specification is very open here and it doesn't just apply to those that "contain errors".

brianjmiller commented 7 years ago

But it does say they are "invalid" so I wonder how you distinguish "contain errors" and "invalid". I'd question your use of voiding in that case, it sounds like you are using it for state maintenance which is definitely a design smell if not just wrong. Voiding should not be employed to undo an action of another statement, it is an undo of a statement itself, effectively saying it should not be (or have been) considered in the stream of statements at all. Undoing an action of another statement should be a new statement that reverses the original's position, it isn't to take it out of the stream.

I agree with @garemoko's comments here. And in a pragmatic world where the LRS is itself a closed system or used in a closed system, then who cares.

garemoko commented 7 years ago

Let's dig into that scenario of LRS A deletes voiding and voided statements (I'm going to call these collectively 'void' statements) after 1 year and LRS B deletes void statements after 6 months.

It does feel in general terms that there could be an interoperability glitch here, but I can't think of any specific problems in the real world. Can anybody else think of any specific problems here?

DavidTPate commented 7 years ago

With the current specification and retention of statements you could fully replay the entire history of the xAPI statements since none should be removed from the LRS. So yes, as long as you are keeping the systems in sync in near real-time (or with some amount of delay) things will be fine.

But what if you add LRS C to that and you need to catch it up to the current state of the others. We have a number of cases for that currently where we are dynamically adding and removing LRS from our chain for a few different scenarios and also synchronizing systems to catch up with the history.

brianjmiller commented 7 years ago

Adding LRS C doesn't matter because it will either ignore voiding statements for statements it doesn't have, or it won't have known about statements it never saw (and shouldn't because they shouldn't have been seen).

RoboSparrow commented 7 years ago

What about Spam or DoS attacks (leaking auth token or a public LRS)?
You are adding a zillion voiding statements on top of a zillion malicious statements.

adlnet / xAPI-Spec

Deleting voided and voiding statements #1003