Use content-based UUIDs for all publicly-addressable Surveyor entities

hannahwhy commented 11 years ago

Currently, Surveyor uses v4 UUIDs for entity API IDs. This makes it impossible to load the same survey data into two different databases and maintain referential integrity across Surveyor entities and other entities that may reference Surveyor entities by their API ID.

This is a major difficulty for NCS Navigator Cases' merge process: with the current scheme, we cannot take data generated by a field client and load it into a testing database without also replicating an entire Surveyor loadout.

However, there is no reason why Surveyor must use v4 UUIDs, because any Surveyor object can be entirely identified by its content. Two answers are equivalent if their type and text are equivalent, two questions are equivalent if their list of answers, text, and dependencies are equivalent, and so forth.

Define a scheme for fingerprinting and switch API ID generation to use v5 UUIDs.

yoon commented 11 years ago

Surveyor's various identifiers: reference_identifier, data_export_identifier, common_identifier, and common_namespace are intended to help maintain references across different databases. Are they not going to be sufficient in this case?

Also, my understanding of the field client workflow is that the deployment of the survey to the field client is an important part of the entire process. I would contend that two answers are not equivalent if their type and text are the same... "Yes" to "Are you a smoker?" is different than "Yes" to "Do you enjoy hiking?" and those questions are different based on the context of the other questions in the section, the survey itself, etc.

That said, I think there could be a case here for allowing easier customization of the UUID generation scheme.

hannahwhy commented 11 years ago

Surveyor's various identifiers: reference_identifier, data_export_identifier, common_identifier, and common_namespace are intended to help maintain references across different databases. Are they not going to be sufficient in this case?

They aren't: responses reference answers, response sets, and questions by UUID.

If it were not for this issue, it would be possible to take Field data containing entirely new entities and apply it to any Cases instance on the same MDES version.

. I would contend that two answers are not equivalent if their type and text are the same... "Yes" to "Are you a smoker?" is different than "Yes" to "Do you enjoy hiking?" and those questions are different based on the context of the other questions in the section, the survey itself, etc.

The answer with type T and text "Yes" considered in the context of the question does indeed have a different meaning. However, the answer (T, "Yes") on its own is equivalent to any other answer (T, "Yes"). This is definitely a very restricted definition, but it is the appropriate definition for UUID generation.

Also, the (answer, question) context is handled by the response: two responses R1 = (A1, Q1) and R2 = (A1, Q2) are clearly not equivalent. R1 and R2 therefore should (and, in the current implementation, do) have different UUIDs.

hannahwhy commented 11 years ago

They aren't: responses reference answers, response sets, and questions by UUID.

That is, responses coming back from Surveyor iOS.

rsutphin commented 11 years ago

Surveyor's various identifiers: reference_identifier, data_export_identifier, common_identifier, and common_namespace are intended to help maintain references across different databases. Are they not going to be sufficient in this case?

I believe the issue @yipdw is describing is one of testing/debugging simplicity. He's trying to make the following possible:

Rails application deployment A sends a JSON-serialized survey to iOS application A
iOS application A generates a response set and sends it back to Rails application deployment A which logs it and then attempts to process it.
Something goes wrong.
A developer takes the logged response set and tries to load it into Rails application deployment B, which already has the appropriate surveys loaded.

Currently this will fail because the UUIDs for the survey, questions, etc., in the response set JSON are unique to Rails application deployment A, even though A and B loaded the same versions of the same surveys.

I would contend that two answers are not equivalent if their type and text are the same... "Yes" to "Are you a smoker?" is different than "Yes" to "Do you enjoy hiking?" and those questions are different based on the context of the other questions in the section, the survey itself, etc.

I agree with this — Surveyor's concept of an answer is scoped to a question. The model API IDs need to be able to resolve to the appropriate Surveyor answer instance. That would just mean including that scoping information when generating the identifiers.

hannahwhy commented 11 years ago

I agree with this — Surveyor's concept of an answer is scoped to a question. The model API IDs need to be able to resolve to the appropriate Surveyor answer instance. That would just mean including that scoping information when generating the identifiers.

So, I'm not sure why that is critical.

If you have two responses R1 = ("Yes", "Are you a smoker?") and R2 = ("Yes", "Do you enjoy hiking?"), then the different meanings of "Yes" are evident by examining the response. Both "Yes"es, however, are an answer in the affirmative and can therefore have the same UUID.

My motivation for considering answers on this level is that it makes answers act as the root for a content-based UUID generation scheme, with UUIDs of children influencing the UUIDs of their parents.

Does Surveyor go the other way?

rsutphin commented 11 years ago

Both "Yes"es, however, are an answer in the affirmative and can therefore have the same UUID.

Perhaps conceptually, but that's not how surveyor works. The point of the API IDs is to resolve model instances that exist in the system receiving the API call, correct? There are separate Answer instances for each question, even if their type, text, and all other identifiers are the same. So to me that means that each one needs a separate API ID.

Or are you suggesting that answer resolution should be scoped to the containing question? That could work, but it seems more complicated than the alternative.

If that is what you are suggesting, you would need to base UUID generation on more attributes than type and text. You'd need to include all the assigned identifiers also (reference_identifier, data_export_identifier, etc.); maybe others.

yoon commented 11 years ago

Currently this will fail because the UUIDs for the survey, questions, etc., in the response set JSON are unique to Rails application deployment A, even though A and B loaded the same versions of the same surveys.

Perhaps for testing purposes, we could specify the api_ids in the Survey DSL itself. That way, you'd have consistent api_ids between installation A and B.

yoon commented 11 years ago

I believe the issue @yipdw is describing is one of testing/debugging simplicity. He's trying to make the following possible:

Again, I think this makes a good case for customizing the UUID generation scheme for this specific case, rather than changing it for everyone.

hannahwhy commented 11 years ago

Perhaps for testing purposes, we could specify the api_ids in the Survey DSL itself. That way, you'd have consistent api_ids between installation A and B.

I do not want to -- and indeed cannot -- use special surveys for testing. The surveys must be identical to those used in the production environment. Otherwise, we are just guessing at the cause of a merge failure.

Additionally, specifying an ID for every addressable entity is overkill when content addressing is possible.

hannahwhy commented 11 years ago

If that is what you are suggesting, you would need to base UUID generation on more attributes than type and text. You'd need to include all the assigned identifiers also (reference_identifier, data_export_identifier, etc.); maybe others.

Yes, whatever is necessary to differentiate one answer from another. If answers are scoped to questions, then they should include their question. Something like

answer.api_id = UUID(question.api_id || answer.text || answer.identifiers)  # || is read as "concatenate"

would be sufficient.

hannahwhy commented 11 years ago

Here's an example of what I mean:

https://github.com/yipdw/imperator/blob/master/lib/imperator/ast.rb#L8-L26

The highlighted code generates UUIDs for nodes in an AST generated from a survey program; generation is performed on node content. (The code does this top-down, but that's an implementation detail.) The general idea can be applied to the existing Surveyor codebase.

rsutphin commented 11 years ago

If answers are scoped to questions, then they should include their question.

It occurs to me that scoping to questions is not sufficient either. In any given Surveyor-using system, you may have several versions of one survey. It's likely two versions of the same survey would have many individual questions and answers that were content-identical, but which trace back to different Survey instances. Either the UUID generation or the resolution process would have to take this into account.

If you are doing bottom-up UUID generation, you could consider tracing all the way up to the Survey. The Survey instance content itself (its title, access code, description, identifiers, etc.) is very likely to be the same between survey versions, so the survey's UUIDs would be the same across versions. You could take version into account when generating the survey's UUID to address this, but then you'd be no better off than with random UUIDs: the survey version is an incrementing integer that is deployment-dependent.

So I don't think that bottom-up UUID generation could distinguish the same question in two different versions of the same survey. I believe that top-down generation would have the same problem, plus it would need special resolution logic in order to use the correct scoping for content-identical elements.

Does anyone see a way around this that doesn't involve a radical complexification of the Surveyor data model?

jefflunt commented 11 years ago

The "content based UUID generation" description, as I read it, sounds identical in its purpose as a hash function. Wouldn't comparing the output of a hashing function for similar entities across apps and systems be more to the point than altering how UUIDs are generated? Throw in what attributes you care about for the purpose of finding cross-system equality, hash them, and if the hash value is the same then as far as you define equality by any set of attributes (so long as it's the same set of attributes everywhere) then the entities are equal.

Aren't UUIDs intended to be a unique name/identifier for something without necessarily having anything to do with the thing they are naming? I know that previous versions of the UUID spec made use of some seeded values in the creation of UUIDs, but the goal always seemed to be to generate labels/IDs/names only, while the relationship between earlier version UUIDs to their content was a side effect of the implementation, not necessarily a requirement to get a unique UUID.

I guess what I'm getting at is that by basing the UUID on the content it is no longer, by definition, a UUID - it is a hashing function. On Dec 12, 2012 4:59 PM, notifications@github.com wrote:

If that is what you are suggesting, you would need to base UUID generation on more attributes than type and text. You'd need to include all the assigned identifiers also (reference_identifier, data_export_identifier, etc.); maybe others.

Yes, whatever is necessary to differentiate one answer from another. If answers are scoped to questions, then they should include their question. Something like

answer.api_id = UUID(question.uuid || answer.text || answer.identifiers)

would be sufficient.

— Reply to this email directly or view it on GitHubhttps://github.com/NUBIC/surveyor/issues/386#issuecomment-11314105.

hannahwhy commented 11 years ago

The "content based UUID generation" description, as I read it, sounds identical in its purpose as a hash function.

I propose using UUIDs based on hash functions, yes. That's the v5 bit: v5 UUIDs are UUIDs generated by calculating a SHA1 of a given string. (More detail: a v5 UUID is constructed by taking the SHA-1 of a name and a UUID namespace, which is represented as a UUID. See RFC 4122, section 4.3. v3 UUIDs follow this same scheme but use MD5 as the hash function.)

Aren't UUIDs intended to be a unique name/identifier for something without necessarily having anything to do with the thing they are naming?

No. Name-based UUIDs are designed for purposes like this. Additionally, v1 UUIDs can still be used in some cases where you don't care about exposing machine details.

Wouldn't comparing the output of a hashing function for similar entities across apps and systems be more to the point than altering how UUIDs are generated?

No. The problem is that ID references made in data sent by Field to a particular Cases deployment cannot be reused in a different deployment, despite the fact that identical surveys (and code lists, etc) have been loaded in both deployments. Adding an additional hash-and-compare step complicates testing and adds complexity that is not needed in the application code.

hannahwhy commented 11 years ago

Does anyone see a way around this that doesn't involve a radical complexification of the Surveyor data model?

What problems exist with considering two surveys with identical content (questions, answers, identifiers, dependencies) to be the same version of the same survey?

rsutphin commented 11 years ago

The "content based UUID generation" description, as I read it, sounds identical in its purpose as a hash function.

It would in fact use a hash function. Check out v3 or v5.

Throw in what attributes you care about for the purpose of finding cross-system equality, hash them, and if the hash value is the same then as far as you define equality by any set of attributes (so long as it's the same set of attributes everywhere) then the entities are equal.

The problem is that the goal is not to check for cross-system equality between survey structures. The goal is to take responses (e.g.) and resolve their references to the semantically appropriate survey structures regardless of deployment details.

Aren't UUIDs intended to be a unique name/identifier for something without necessarily having anything to do with the thing they are naming?

They can be, but it's not required. As the wikipedia article says, "the intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination." Defining a way to generate an identifier from content is one way to have an identifier without central coordination.

rsutphin commented 11 years ago

What problems exist with considering two surveys with identical content (questions, answers, identifiers, dependencies) to be the same version of the same survey?

I was unclear, I think. The problem would be that you might have two surveys which were different at some points, but identical at others. The identical parts would get identical UUIDs and so you would not be able to distinguish them while resolving. However, you would ultimately need to distinguish them so that you were associating all of the responses with the corresponding sections/questions/answers in the correct single survey.

The parts that would cause problems are different whether you are doing bottom-up or top-down UUID generation, but I think that both modes would have a species of this problem.

hannahwhy commented 11 years ago

I was unclear, I think. The problem would be that you might have two surveys which were different at some points, but identical at others. The identical parts would get identical UUIDs and so you would not be able to distinguish them while resolving. However, you would ultimately need to distinguish them so that you were associating all of the responses with the corresponding sections/questions/answers in the correct single survey.

Currently, yes: if this was applied as-is in a top-down fashion, you'd end up with lots of answers having the same UUID (because you might have lots of "Yes"es). Those will have to be deduplicated at load time.

I don't think that adds a lot of complexity to Surveyor's data model, though. It does mean that the parser/loader has more work to do, but I don't think it's very complex work.

If every node in a survey AST had a UUID, the parser/loader can generate a table of UUID to entity (doesn't have to be separated by type -- the UUID will let us get away with a flat namespace), and then just save every entity by iterating over the table. If there are relationships that must be maintained during the save phase (i.e. entity A must be saved before B), that makes things a bit harder, but the table idea still works for generating one copy of each unique entity and not saving entities that already exist.

hannahwhy commented 11 years ago

doesn't have to be separated by type -- the UUID will let us get away with a flat namespace

Actually, it probably makes sense to have separate tables per entity. The collision probability is a concern (albeit a very small one), but the more important reason is that Surveyor's API IDs are already scoped by object type and the loader should do the same thing.

hannahwhy commented 11 years ago

I think there's a bit of confusion about "top-down" and "bottom-up". (I've used the terms in different ways, at least.) By "top-down", I think the definition being used here is source to sink in the Surveyor object graph: "a survey's UUID is determined by the survey's content concatenated with the content of all of its directly and indirectly associated entities". A question's UUID is determined by the question's content concatenated with the content of its answers. (And probably dependencies.) And so forth.

Bottom-up would go from sink to source.

yoon commented 11 years ago

Currently, you can override Surveyor::Common.generate_api_id to customize generation with your own scheme. Are you looking for something easier than that?

hannahwhy commented 11 years ago

Overriding Surveyor::Common.generate_api_id is insufficient:

That method has no knowledge of the object being identified.
That method cannot be executed at instance scope (it's on the module's metaclass), so you can't use self.

A content-based identification scheme requires access to said content.

rsutphin commented 11 years ago

Currently, you can override Surveyor::Common.generate_api_id to customize generation with your own scheme. Are you looking for something easier than that?

Using content-based UUIDs will require changes to assumptions about the cardinality of relationships between surveyor structures. Just changing the way the UUIDs are generated will not work.

yoon commented 11 years ago

A content-based identification scheme requires access to said content.

Makes sense. So, can a case be made for this to be changed for everyone using Surveyor?

Achieving this outside of Surveyor is currently possible. It would involve customizing all of the model initializers (actually, the default_args methods)

hannahwhy commented 11 years ago

Makes sense. So, can a case be made for this to be changed for everyone using Surveyor?

The case for changing it in Surveyor (and, therefore, for everyone using Surveyor) is that the API ID for a Surveyor entity should be opaque to all users of Surveyor's API. The only code that needs to know details of the ID are the generator and lookup code, both of which belong to Surveyor.

Additionally, this change is fully backwards-compatible. v5 UUIDs are guaranteed to not clash with v4 UUIDs (the UUID version nibble is 5 instead of 4), and the cardinality changes -- if enforced at survey load time -- do not affect previously-loaded survey data in any way.

This may negatively impact survey load time: generating a hash of content may take longer than generating a random number. However, UUID generation does not require object persistence (it just requires attributes associations to be in place), so I can say that this can be done without incurring a database penalty.

(Indeed, it could make survey loading faster. Deduplication at load time implies (for example) that there's only one Answer entity that means "Yes", which means that you have to save less stuff.)

Achieving this outside of Surveyor is currently possible. It would involve customizing all of the model initializers (actually, the default_args methods)

To do this, you'd end up not using Surveyor::Common.generate_api_id at all: you'd have to implement API ID generation code for each addressable entity. You'd also have to customize the survey loader to perform deduplication or find another way to implement the cardinality changes mentioned by @rsutphin.

It may be possible, but it requires changes to Surveyor that are deep enough that I don't think it'd be maintainable.

We could fork Surveyor to do this for NCS. (IIRC, there was at one point an ncs branch.) However, again, I don't think that overhead is worth it.

hannahwhy commented 11 years ago

Additionally, this change is fully backwards-compatible

Well, actually, no. User code that assumes that an Answer uniquely identifies a Question will have to change to consider the Response. Similar changes apply to other Surveyor entities.

rsutphin commented 11 years ago

Looking through the Survey structure models, I sketched out this tree-ish thing:

survey
|
+- survey_section
   |
   +- question
      |
      +- answer (answer belongs_to question, question belongs_to correct_answer)
      |  |
      |  +- dependency_condition
      |  |
      |  +- validation
      |     |
      |     +- validation_condition
      |
      +- dependency
      |  |
      |  +- dependency_condition
      |
      +- dependency_condition (belongs_to both question and dependent_question)

question_group
|
+- question
|
+- dependency

I was doing this to try to audit Surveyor's own code to see if there were any places where it used reverse references (e.g., from an Answer to a Question). I found some more serious problems.

There are repeated nodes because the structure is not a tree. It's also not a DAG — if we take the direction of edges as being from the has_ relationship (where present) to the belongs_to, there is at least one cycle:

          +-<- correct_answer -<-+
          |                      |
question -+                      +- answer
          |                      |
          +--->--- answers -->---+

This would have to be accounted for somehow in the content-based UUID generation. Perhaps it doesn't matter where the relationship lives? Question's content is still fully described by its own attributes and its children.

The relationships between DependencyCondition and Question are also problematic. It's not a simple container relationship:

question --> dependency --> dependency_condition -+
 | |                                              |
 | +--------<------- question ----------<---------+
 |                                                |
 +-------<----- dependent_question -------<-------+

This one I don't have an idea how to handle. Question's content is fully described by its own attributes and its children, but DependencyCondition's isn't.

NUBIC / surveyor

Use content-based UUIDs for all publicly-addressable Surveyor entities #386