Specify the audit log CSV format

lognaturel commented 5 years ago

This spec intends to provide sufficient information for downstream tools to process client audit logs and for additional clients to create compatible logs.

I considered linking to https://tools.ietf.org/html/rfc4180 but it prescribes CRLF linebreaks and is more verbose than I think we need. I also ended up omitting "each line should contain the same number of fields throughout the file" because it seems easy enough for downstream tools to end a row on linebreak. Should it be included?

I verified this by running bundle exec jekyll serve and looking at the generated site in Chrome.

Perhaps I should consolidate the templates for sub specs now that we have two. Or maybe we wait for one more? 😬

CC @grzesiek2010

MartijnR commented 5 years ago

Thank you!

Perhaps I should consolidate the templates for sub specs now that we have two. Or maybe we wait for one more? 😬

Unless, you are really eager, I the 3rd sub-spec would be a good time! :)

lognaturel commented 5 years ago

I came to an unfortunate realization at https://github.com/opendatakit/collect/issues/3126. The way the XPath paths are written out by Collect is idiosyncratic: group names are always followed by a position predicate ([1]) except in the case of terminal groups with the field-list appearance. That is, /audit-group-multiplicity/g1[1]/g2[1]/g3[1]/q3 instead of the more expected /audit-group-multiplicity/g1/g2/g3/q3 for a question in three levels of groups.

Those paths are not incorrect so it doesn't matter for analysis tools that actually speak XPath. We do know that folks are writing their own analysis tools or doing quick visual checks, though, and for them it's important that the exact node names be consistent, I think. And ideally, that would include consistency between clients.

I see a couple of options:

document that position predicates are always included for container form control (groups and repeats) except for terminal groups with the field-list appearance
change what Collect does to not have those position predicates. This would break some things but it's certainly easier to do now than later.

@MartijnR, do you have a strong preference one way or the other? A different option I'm missing?

MartijnR commented 5 years ago

except for terminal groups with the field-list appearance

That sounds intriguing and surprising. Maybe I don't understand.

But in any case, my preference would be to not use position predicates except for repeated groups. For repeated groups it might be helpful to always include positions (even if only 1 instance)

MartijnR commented 5 years ago

Actually, this brings up another point that we had to resolve in OpenClinica's fork (though not related to logs for them). You can remove a repeat instance (e.g. the second, and then later create a new second instance, or the already existing third instance becomes the second). To deal with this (in audit log data) we added an immutable ordinal attribute to repeats, as well as a last-used-ordinal (to keep track of this in case the last repeat instance is removed). Both are in http://enketo.org/xforms namespace.

lognaturel commented 5 years ago

That sounds intriguing and surprising. Maybe I don't understand.

Indeed. Imagine that you have a method that outputs paths with position predicates for every node and then that you have a client that does navigation with page flips. It looks like the position predicates were stripped off nodes that correspond to pages -- questions or field lists. I'm kicking myself for somehow not noticing.

my preference would be to not use position predicates except for repeated groups.

Yes, I agree.

For repeated groups it might be helpful to always include positions.

Agreed.

You can remove a repeat instance (e.g. the second, and then later create a new second instance, or the already existing third instance becomes the second).

I think we had considered the shifting repeat positions not worth a special case but it's true that it's now much more difficult to do things like calculate how much time was spent in a specific repeat instance or track answer changes in a specific repeat instance.

To deal with this (in audit log data) we added an ordinal attribute to repeats.

Can you say a bit more about this? Do you mean Enketo core keeps counters of repeat instances over the lifetime of a record and includes an ID from that count in the log?

MartijnR commented 5 years ago

It's like this:

<repeat enk:last-used-ordinal="4" enk:ordinal="1">
   ...
</repeat>
<repeat enk:ordinal="3">
   ...
</repeat>

The ordinals are 1-based and are stored in the model/submission, so loading a submitted record for editing maintains the state.

In the above example 4 <repeat> elements were created and the 2nd and 4th were deleted. A new repeat would get ordinal 5.

P.S. OC has a higher-than-normal need for this, because each new value or new repeat is immediately submitted to the server as a fragment (and not the final complete record). I wonder if that is also meant to be trackable in the audit logs (all the intermediate states before submission).

lognaturel commented 5 years ago

For repeated groups it might be helpful to always include positions (even if only 1 instance)

@MartijnR and I discussed this briefly in Slack and agreed to always include positions for repeats. This avoids the case where the first repeat instance is initially logged as /data/repeat but then if edits are made to it after more repeat instances are added, it's logged as /data/repeat[1].

Filed #248

getodk / xforms-spec

Specify the audit log CSV format #234