OHDSI / Atlas

ATLAS is an open source software tool for researchers to conduct scientific analyses on standardized observational data
http://atlas-demo.ohdsi.org/
Apache License 2.0
266 stars 136 forks source link

Add UUID to Cohort Definitions #496

Closed jduke99 closed 1 year ago

jduke99 commented 6 years ago

Desired behavior

When a cohort is created in Atlas, it should be assigned a GUID. This GUID should be included in the cohort JSON. The GUID should be written to the COHORT_DEFINITION table.

Every time the cohort is Saved, a new GUID should be created and the JSON and COHORT_DEFINITION table should be updated.

Rationale

For network activities, it would be extremely helpful to have a code that is consistent across every site for every cohort. We cannot use cohort_id for obvious reasons. This would essentially create a unique versioned code for every cohort and every version of that cohort.

This will be hugely helpful for many reasons. Please let me know if any objections, otherwise we will code this a send in a pull request.

gklebanov commented 6 years ago

Hi Jon. That is a good and useful change, indeed. But that would require a change to the COHORT_DEFINITION table. Do you envision a new column/field in the COHORT_DEFINITION table (http://www.ohdsi.org/web/wiki/doku.php?id=documentation:cdm:cohort_definition) and JSON? How are we going to handle upgrades and legacy data? Just some questions for all to think about

From a general idea, we have discussed this adding this field into cohort definition in a broader model perspective at the Architecture WG, would be great to bring you into that discussion @fdefalco

cgreich commented 6 years ago

Friends: And how are we making sure the GUIDs created by different Atlantes (this is the correct plural) are not clashing?

jmbanda commented 6 years ago

How about either using a site and dataset specific prefix? Or using a said site and dataset ID as part of the hashing function to generate the GUID?

This ID can be generated when a site registers a dataset on the data network page.

On Oct 27, 2017 10:29 AM, "Christian Reich" notifications@github.com wrote:

Friends: And how are we making sure the GUIDs created by different Atlantes (this is the correct plural) are not clashing?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OHDSI/Atlas/issues/496#issuecomment-340034712, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ8OiCQYbMCrt-EiWMpX17fPRqcBp4g0ks5swhLygaJpZM4QJP0n .

cgreich commented 6 years ago

Oh! Definitely the latter. The former - that's what brought us the NDC. :) It's chaos.

A hashing function: We don't need site-id for that, as long as the hash created from, say, the abstract is long enough that duplication risk is negligible.

chrisknoll commented 6 years ago

Could you go a bit deeper into how this GUID will be used? We've had many discussions over many years about a global numbering scheme, but nothing stuck because of the 'devil in the details'. For example:

Your note on versioned cohorts also raises question (which versions of cohorts was also discussed, but remains unsolved): With versions of cohorts, when you store a result, do you attach a cohort_id + version_id to each record in the results schema? So all T-C-O pairs consist of t_id, t_ver, c_id, c_ver, o_id, o_ver?

I think a hashing function makes sense in the context of verifying the 'integrity' of a cohort definition. For software downloads, you are usually provided with an MD5 hash to verify the binary that you've downloaded was not altered. Likewise for cohort definitions, when using a 'global' cohort definition, if it has an MD5 hash associated with it, you can verify the integrity of the cohort definition with an expected hash. but I don't see it useful for identification.

-Chris

fdefalco commented 6 years ago

I would like to suggest that before we suggest modifying the structure of the cohort definition that we publish a specification for the complete cohort definition document. We have committed to doing so for all our data objects including concept sets and cohort definitions. With a published specification we can have the discussion on what we might want to change about it, including unique identifiers.

Additionally, we've been discussing specification repositories as native json documents, potentially in a document store like mongodb. I think we should try and align on the specifications and specification store before altering the specification itself.

I'm supportive of continuing the conversation here and would also suggest we use this topic for this coming week's architecture deep dive on our call if people would be available to discuss.

jduke99 commented 6 years ago

Glad I struck up a good conversation!

I am not surprised to hear that wiser minds than mine have already been thinking about it. The OHDSI work for FDA has pressed the issue a bit as we need a perfect level of confidence in matching cohort definitions no matter where they live. (And the FDA timeline is quite short.)

1) I agree with @fdefalco that we need to have full specification for the cohort definition document.

2) Yes this would involve an additional field in both the table and JSON. But I believe it could be done so as to backwards compatible.

3) There will be no collisions. This is the point of a GUID. The likelihood of a collision is 1 in 2.71 quintillion. If the OHDSI community produced a billion new cohort versions every second for 85 years it would reach a 50% likelihood of a collision. We are just not that productive. :)

4) @chrisknoll, GUIDs are typically stored as CHAR in the database since they are of fixed length.

5) Local changes directly to the database we can't prevent, but it's a weird thing to do. Anything done through Atlas should create a new GUID with every save.

6) I am not proposing this as a versioning system. This is not designed to persist every save as its own retrievable cohort definition for recovery etc. This is would be just the usual cohort definition process, only with a new GUID each time. Why? Because when I export and share the cohort definition with another site, I want that GUID to get written to their database when it is saved. So that if my NLP application needs to find that cohort later, I can find it by GUID and not worry about local cohort definition ids. I also have confidence that the version they ran is exactly the one I sent. (Assuming no nefarious business going on.)

There are of course plenty of weird edge case issues that we could turn up. This could also be done with a hash of the JSON or whatever. But GUIDs are a very simple solution to this particular problem.

I hope we can work on this particular issue and versioning separately, even though they have some ostensible connection. (My apologies for referencing versioning in a confusing way in my original post.)

cgreich commented 6 years ago

Friends:

Sounds like we are arriving (before your meeting @fdefalco) at a hashing algorithm of the JSON definitions, allowing to recreate the ID even when it was lost and making sure its integrity. And we could use the cohort_id field rather than creating a new one, if instead of the hex notation we do decimal (or 128 bit binary): All we have to do is to make it bigint. That would make it backwards compatible.

gklebanov commented 6 years ago

Jon, I agree on your GUID usage - associating a cohort definition with a GUID, thus making every design unique.

And generating GUID for each save would definitely help us to avoid clashes. But an interesting approach would be to consider is to generate a new GUID as a hash generated based on file content, thus making it unique for every design rather transaction. This will also prevent people from changing design as it would change GUID immediately.

This, of course, would surface a problem of multiple identical cohort designs by one or various people.

Btw, we could also seed GUID with unique namespace (organizational URL) if we want to bring organization spin into this.

cgreich commented 6 years ago

@gklebanov:

Why is it important to know which organization created a cohort design? A design is a design. It belongs to the people. :)

gklebanov commented 6 years ago

@cgreich - I am not sure if I agree with re-using cohort_definition_id to store shareable GUID is a good idea.

a) there is a difference between primary key and unique identifier in respect that PKs are typically immutable and UIDs can change

b) primary keys are typically immutable as changing it would invalidate references to it across tables

c) should not be exposing internal system primary keys to the world as this could open this system up (vulnerability)

cgreich commented 6 years ago

@gklebanov:

Well, plus 128 bit = 16 Bytes, while bigint is only 8 Bytes. So, maybe I withdraw the idea of using the cohort_id and propose adding a GUID field. That could use the ugly hex notation.

jmbanda commented 6 years ago

While we are it, we should add some extra meta data to make the cohort definitions FAIR compatible.

On Oct 28, 2017 7:59 AM, "Christian Reich" notifications@github.com wrote:

@gklebanov https://github.com/gklebanov:

Well, plus 128 bit = 16 Bytes, while bigint is only 8 Bytes. So, maybe I withdraw the idea of using the cohort_id and propose adding a GUID field. That could use the ugly hex notation.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OHDSI/Atlas/issues/496#issuecomment-340197202, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ8OiL6O4EvGBp9Phb5EMtNjX5S4U_R0ks5sw0FfgaJpZM4QJP0n .

fdefalco commented 6 years ago

For those who might not be familiar, like me:

http://datafairport.org/fair-principles-living-document-menu

chrisknoll commented 6 years ago

Hello, @jduke99 , and everyone else involved,

While building the current cohort definition specification is a worthy cause, but think we're doing important work on defining the functionality that we'd like to have supported. Following the "form follows function' principle... https://en.wikipedia.org/wiki/Form_follows_function ...we should lay out how the elements of the cohort definition should function and then produces the specification to specify the functionality. I'd like to summarize the current function of the existing cohort definition structure just to give us a baseline of what functionality we currently support (note: this isn't a replacement for official documentation, just some background information to guide the discussion):

Cohort definitions support the following features:

1. Local identification

Current feature specification assumes a local id generation mechanism. These are not intended to be shared across different WebAPI instances.

2. Name and description

Current feature specification provides for storage of a name and description of the cohort.

3. Created/Modified info

Current feature specification provides storage of the username/timestamp of the identity who created/modified the definition

4. Cohort Expression Type

This is an important one: cohort definitions can be materialized by many different mechanisms: CIRCE expressions (what is used in Atlas' cohort builder), custom SQL, or external source. Cohort definition field: https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/cohortdefinition/CohortDefinition.java#L70 Expression Types: https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/cohortdefinition/ExpressionType.java The idea here is that WebAPI cohort definitions could be materialized by different mechanisms, so we incorporated that feature into the current cohort definition table, but at this point the only expression type that you can use in WebAPI is the 'simple_expression' type (which is CIRCE).

5. Cohort Details

Since the cohort definition could be materialized in a variety of ways (theoretically), the details section of a cohort definition just contains a very large character blob to store the expression definition. There's no rigid specification about what can be in this field, except if the cohort_definition_type is 'SIMPLE_EXPRESSION' then it is assumed to be a CIRCE JSON object. However, if we wanted to allow custom sql based definitions, the current definition spec handles this, we just would have to implement an 'custom sql executor' for use in the cohortGenerationTasklet. Other ideas for a cohort expression could be a batch of R code that could be executed to materialize the cohort. The idea of this function was to allow flexibility.

So, that's the baseline of current functionality.

For the new request from @jduke99 , it looks like the functionality we need is to be able to create a globally unique identifier to allow a cohort definition to be uniquely identified across all WebAPI instances (as opposed to the current local identification structure we have today). Some notes:

If that covers the specific feature request of @jduke99 , then I think the work is localized to just altering the cohort_definition table. The only point I would push back to @jduke99 on is including the global ID into the JSON object. The issue here that (in the case of CIRCE, but would also apply to custom SQL definitions) is that there's no notion of self-identity in these JSON objects. They only contain the information that the SqlBuilder needs to generate the cohort. The requirement for a name and an identifier is at the WebAPI's cohort-definition context. In Atlas, the 'export' tab isn't exporting the entire definition, it's exporting the expression. If we wanted to alter the export functionality to be more 'definition-centric' instead of 'expression-centric', we can make that change, and the exported JSON would look something like this:

id: 12345
globalId: "be10cab0-0a60-4295-9985-c9e4f60b6f7a,
expressionType: "SIMPLE_EXPRESSION",
expression: // Encoded CIRCE JSON expression

But the actual structure of a CIRCE expression shouldn't depend on the global identification of WebAPI cohort definitions.

So, with @jduke99 new feature, I could imagine a CIRCE enhancement which you could create a "Cohort Criteria" that accepts a cohort global identifier, and part of the SQL builder would use a 'cohort global id resolver' to resolve the global ID into the local ID, and make the appropriate queries of the result schema's cohort table. So this is another use-case in support of leveraging a global identifier.

The additional idea of the 'FAIR' principle is a good idea, but also requires some additional infrastructure to support the other elements such as a searchable index, and I think generating new GUIDs on each save may violate the 'persistent identifier' principle under section F1. IE: if someone is referencing a GUID that gets updated with a new definition, there's no longer anything found at the old GUID (it's been erased). So a question I have is: do we try to make this feature request FAIR or do we live with an idea of a globally unique identifier and add FAIR principles at a later time (which may be tougher to implement in a backwards-compatable way later...)

jduke99 commented 6 years ago

Thank you all for the thoughts on this. I have come around to agree with many (all?) above that a hash might be as good or better than a GUID in the specific context that I have laid out. I am really interested in the integrity of the definition rather than it being the exact one that I created. So if someone goes in and mucks around then unmucks it, I guess no need to throw the baby (cohort) out with the bathwater (changed GUID but same actual content). I'm sure someday we'll have "locked" cohorts, but for now I'm comfortable with a hash as a solution if that is preferable to others.

ps. agree with @chrisknoll that the JSON does not need to contain the ID. Just wanted to make sure that when I send a cohort to someone it is findable by another process. A hash actually probably does this better than a GUID.

fdefalco commented 6 years ago

What if we were to use the url where the cohort was instantiated as part of the cohort definition specification instead of a guid. The combination of identifier, hostname and domain name would make every identifier unique. In terms of an example:

{
  "id": 922826,
  "url": "http://api.ohdsi.org/WebAPI/cohortdefinition/922826",
  "expressionType": "SIMPLE_EXPRESSION",
  "expression": "expression content"
}

I like the idea of adding a SHA1 hash as well.

chrisknoll commented 6 years ago

Because the url might not be reachable when you import the definition into another system, ie if we export a definition for Jon the URL we export will have epi.xxx.xx which is't reachable. If you don't intend to use the URL as a means to refer to the original definition, but only to use as some source identifier, you still run into issues where at time t0 you export the cohort definition with the URL above, and then you make a change and export it at t1, the same URL will be pointing to different cohort definitions. The nice thing about the GUID is that it's changed when you save it, so if you export it with the guid, if someone else exports the cohort definition for someone else, the guids should match if the definitoin was not changed. In fact, I don't think we should put the local cohort identifier in the export at all, that number doesn't mean anything when it's being imported into another source.

@jduke99, what if instead of generating a GUID on every save, what if instead there was an action in the UI to 'lock' the cohort definition, at which point a GUID will be generated and attached to the cohort definition. At this point, you can no longer make changes to the cohort definitoin, but you can copy it (and the copy will not have a GUID generated for it, and it will be 'unlocked'). When someone imports the cohort definition, the imported definition will only have the GUID:

name: "Treatment resistant depression"
globalId: "be10cab0-0a60-4295-9985-c9e4f60b6f7a,
expressionType: "SIMPLE_EXPRESSION",
expression: // Encoded CIRCE JSON expression

On import, a local ID will be created (by nature of import = new cohort definition) but the import will have the globalID assigned to it, so it will be imported as a new cohort definition, but immediately locked from changes.

Thoughts?

-Chris

gklebanov commented 6 years ago

I think that GUID should always be included into JSON and cohort payload when shared externally (and local Ids should not be there).

I think using URI could be another way of achieving unique key. it does not matter that the URL is not reachable, this is just to create a UID. But in this case you would lose the ability to detect code changes since that UID is static and will never change if the content is changing.

Since GUIDs are not primary keys but simply unique identifiers, it is expected that they can and most likely will change if content of the cohort is being used as hash to generate it. And it is OK. I would rather have GUID as a mandatory field. I do not think we should bind ATLAS UI as a mechanism that triggers generation of that GUID.

I am not sure what "locked" would achieve in ATLAS from code sharing perspective since not all sites use ATLAS/WebAPI to execute cohorts. ARACHNE is the platform that was created with a purpose of conducting federated studies and facilitate code sharing and execution. As the matter of fact, there is already a capability to lock code in ARACHNE when cohort is being executed across sites and it tracks code changes between runs by using hash on code content to make sure there was no code tweaking by anyone.

Saying all that, I think it would be great to have some sort of a "draft/published" flag in ATLAS/WebAPI to show that design is final and ready and to possibly prevent changes after they are published (shared).

fdefalco commented 6 years ago

I don't think having a reachable endpoint is necessary to be able to uniquely identify cohort definitions. Using a URL, instead of a GUID, would have the added benefit of providing lineage of the definition origin.

The underlying cause of the issue you describe about the definition at t0 and t1 being different is a lack of versioning, which I believe should also be addressed.

This definition could refer to the "current" version of this cohort:

{
  "name": "chronic depression",
  "url": "http://api.ohdsi.org/WebAPI/cohortdefinition/922826",
  "expressionType": "SIMPLE_EXPRESSION",
  "expression": "expression content"
}

whereas this definition could refer specifically to version 3 of the definition:

{
  "name":"chronic depression",
  "url": "http://api.ohdsi.org/WebAPI/cohortdefinition/922826/3",
  "expressionType": "SIMPLE_EXPRESSION",
  "expression": "expression content"
}

and in the case of sharing a cohort definition across the community the explicit version url could be used to be able to identify the specific version that was shared regardless of future changes to the cohort definition without a lock/publish feature requirement.

gklebanov commented 6 years ago

@fdefalco - yes, Frank - I think it could be another good alternative approach. I like that you can see where it came from etc.. But wouldn't we need to then add a version field to capture cohort revision. And there is nothing wrong with doing that - actually, I think this is another gap that we could address at some point.

fdefalco commented 6 years ago

Yes - and we don't have versioning of any elements currently so this would be another new requirement to implement and not the easiest change either.

chrisknoll commented 6 years ago

Hey, Guys, Let's not get too excited about the statement about 'reachable': the example provided was reachable so I took it at face value. I did say that even if we didn't care about reachability, we still have the issue of someone changing the definition but the URL not being different, which @fdefalco updated his proposal to try to account for that, but also introduces sweeping changes in how cohorts are referenced by version. A point I made initially about how T-C-O pairs would be addressed as T,T-ver, C, C-ver, O, O-ver, and how it's a complicated change to refer to cohorts by id+ver.

@gklebanov : I think you misunderstood what I was describing:

Since GUIDs are not primary keys but simply unique identifiers, it is expected that they can and most likely will change if content of the cohort is being used as hash to generate it. And it is OK. I would rather have GUID as a mandatory field. I do not think we should bind ATLAS UI as a mechanism that triggers generation of that GUID.

I didn't mean that atlas would manage this generation, I meant that WebAPI would have a new function to 'lock' or 'publish' that Atlas could call (ie: WebAPI/cohortdefinition/#/publish) which would create the GUID server side, and on the WebAPI route for saving (ie: PUT: WebAPI/cohortdefinition/#) saves would be rejected with error if the GUID was present.

If ARACHNE supports maintaining cohort definitions in its own system, similar checks should be made to ensure that a published cohort definition isn't modified. I think the focus of this specific discussion is the implications for Atlas and WebAPI, but we should keep it flexible so that external parties such as Arachne can adapt to it.

The proposal of GUID would be that the GUID is the 'external' identifier for the cohort definition and wouldn't require any new infrastructure to support functions like versioning...it would be a new field in the definition and logic would be added on the server to handle cases of updating, exporting and importing. We don't have to go into detail about what a GUID is, and the constraint is that people use a robust GUID generator if they are doing their own publishing (such as outside of WebAPI).

The proposal of revisions is a bit more extensive: url schemes need to be defined (such as how a domain should be allocated to a given party, letting people make their own URL identifiers runs the risk of collisions). multiple fields might be needed (a revisionID, isPublished, etc). If we want to tackle versioning so that you can recall the different revisions of a cohort definition, then that would be additional complexity.

anthonysena commented 6 years ago

Sorry I'm late to the party. My vote is to add an additional column for storing the UUID for a cohort in the cohort_definition table and constrain it to be unique. We can allow NULLs in this column for backwards compatibility. When a cohort definition is saved, we should populate this column with a new UUID each time to ensure uniqueness per @jduke99's original requirements.

Creating a HASH off of the cohort definition is also possible but feels more complex than generating a UUID which can be done in Java and R.

The UUID should then be exposed through the cohort definition JSON so that it uniquely identifies that cohort and makes it transportable. Also, it allows for other analyses to reference the population by UUID instead of the local identifier.

Also, I'd recommend that we refer to this a UUID instead of a GUID since the term GUID refers to the Microsoft implementation of the UUID standard.

jduke99 commented 6 years ago

Thank you to @anthonysena for de-Microsoft'ing my suggestion. Indeed it's a UUID.

This is a moving target, but here's my 2 additional cents:

1) I understand @fdefalco's logic for using the URL, and it could work with the versioned URL ("http://api.ohdsi.org/WebAPI/cohortdefinition/922826/3") model. But that would force us to solve versioning perhaps sooner than we want to. I think the non-versioned URL idea may have been retracted. But if not, I would say it is not a good idea and will expand only if it's still on the table.

2) I like @chrisknoll 's idea of locking a cohort definition, fixing the UUID, and when unlocked it gets a new UUID.

3) If no locking and we create a new UUID with each save, we should think about the initial import and save process. Would not want to create a new UUID then since it will immediately differ from the imported definition.

4) I like having having the UUID in the cohort definition JSON.

Happy Halloween!!!

cgreich commented 6 years ago

Friends:

Versioning, helping UIs organize this - all this should not be part of the CDM. We can't assume Atlas/Arachne and their choices are the only ones places where this happens. The OMOP CDM is generic and public. If you want versions - do that off of the UUID in your respective system.

Which also means that the UUID or URI should not contain information about any concrete instances. It should be an ID of a definition.

Which is why I like the idea that the UUID should be derived by a predefined algorithm from the JSON/SQL/blob.

Which also means that I agree with @chrisknoll that the UUID should not be part of the definition. Otherwise, you change the definition but forget to change the UUID and create a conflict.

But: Whatever you come up with: Please turn it into a proposal for the OMOP CDM WG.

fdefalco commented 6 years ago

Ah, but @cgreich, we don't store cohort definitions in the OMOP CDM.

cgreich commented 6 years ago

@fdefalco: Right. And now you guys want to store a UUID that identifies a definition in there, correct? In a way, it is new territory. Because so far, OMOP CDM was a closed thing. No references to something outside. But with the Cohorts this makes sense to me.

fdefalco commented 6 years ago

No, I do not believe we would put a UUID into any of the OMOP CDM tables.

cgreich commented 6 years ago

@fdefalco:

Any more insight why not, or you just don't "believe"? :)

fdefalco commented 6 years ago

@cgreich and I discussed this a bit offline but for the sake of the discussion - today there are tables named cohort and cohort_definition in the CDM schema, however they are not used directly by the WebAPI as the WebAPI saves cohort definitions to the OHDSI schema and cohorts to the Results schema. There is a proposal to formally remove the tables from the CDM schema however that has not yet been accepted. This doesn't change the conversation about having a UUID to identify cohort definitions but if we were to leave these tables in the CDM schema then we would have to consider the impact to that schema as well.

jduke99 commented 6 years ago

+1 for removing cohort / cohort def from CDM schema

cgreich commented 6 years ago

Friends:

Discussion died out, but it is a good one. Do you want to bring it into the CDM WG and make an official proposal, including the GUID thing?

jduke99 commented 6 years ago

Sorry to let this one fall off. @cgreich / @fdefalco @anthonysena @chrisknoll Do you guys think this is more of a CDM WG thing or an Architecture WG thing? Seems more Architecture to me. Thoughts?

ps Congrats on 2.3.0!

gklebanov commented 6 years ago

@jduke99 - I agree with you. As we are planning to define a a standard payload for study design and data exchanges, it would definitely fit into that area. UUID would then simply be a reflection of that standard into OMOP CDM. Never mind UUID is not related to data but rather to design artifacts

chrisknoll commented 6 years ago

I don't think this is either CDM or architecture: my thoughts on this was that it would amount to a new field in the CohortDefinition, which has some rules about when it gets generated (such as introducing the notion of 'locking' the cohort which generates a UUID when locked, and removes the UUID when unlocked). Another consideration is the current export is focused on the inner expression which wouldn't contain this UUID, so we might want a higher-level export of the definition which includes the UUID for the embedded expression.

But, the progress on this stopped when it was suggested that there needed to be some sort of specification around cohort definitions and concept sets. If we don't need to wait for that, we should be able to move ahead. Otherwise, we're waiting until that gets resolved.

pavgra commented 6 years ago

@jduke99, @chrisknoll, @fdefalco,

We would like to work on the feature. Summarizing all discussions above, seems like final reqs are following:

  1. Generate UUID on entity create
  2. Single interceptor for UUID generation for all entities (via Hibernate events). At this point we'll apply it for Cohort Definitions only, but later, if it becomes required for another entities, there will be no need in code duplication.
  3. Add new field in Export JSON for Cohort definition

But, anyway, @cgreich, mentioned reasonable point, that e.g. if we want to use the UUID in some other app, which imports entities from multiple Atlases, the field cannot be trusted.

gklebanov commented 6 years ago

@pavgra , please keep in mind that the whole point of UUID is to have a unique identifier not only for cohorts in ATLAS/WebAPI, but also - in the future - for cohorts originating from any other system. It is just a first step in the process of defining a common cohort definition exchange standard. And eventually to extend to other entities, if decided.

But I feel that there are multiple somewhat colliding ideas here - having an unique immutable identifier (that is not changing with content changes) vs. having a way to identify a cohort instance (e.g. it would be changing every time content changes). Personally, I was looking for #1 (immutable), where I feel that versioning could be achieved by another mechanism or a field.

Let's document our understanding of requirements (e.g. what we are trying to achieve) and a resulting solution (how it will be implemented) and propose it to the group first, probably in the interactive call?

cgreich commented 6 years ago

@gklebanov:

I think the issue is not versioning, but unique identity. Remember: People can exchange them any mechanism, and if you got the same cohort/protocol from multiple directions you don't want to duplicate them.

gklebanov commented 6 years ago

@cgreich

I think the issue is not versioning, but unique identity

I agree 100%. e.g. to have UUID to locate that cohort anywhere regardless of a version.

If it exist, then use another mechanism (last edited date time stamp, updated by) to assess the version and update it if required.

chrisknoll commented 6 years ago

@pavgra, I made a pretty specific description of the use of UUID above, but your summary doesn't follow that:

Generate a UUID on entity create

The way I described it: a UUID would be generated when you lock the content. There's no point in having a UUID that returns one form of a cohort in one env and another form of the cohort in another environment. The idea is: if you create a new cohort, you can lock it and generate a UUID on it so that you can share it. When someone imports the cohort definition, the UUID gets brought in and the cohort becomes locked. If you want to change the cohort, you unlock it, and that clears the UUID. If you re-lock it, a new UUID is made. I probably should make a activity diagram to describe the process, but creating a UUID to uniquely identify a cohort doesn't really help if the cohort can change while maintaining the same UUID.

Single interceptor for UUID generation for all entities (via Hibernate events). At this point we'll apply it for Cohort Definitions only, but later, if it becomes required for another entities, there will be no need in code duplication.

More like a generic 'locking' mechanism for all entities if you agree to my former statements.

Add new field in Export JSON for Cohort definition

Yes, this we agree on.

But beyond that, if we have this 'locking' approach, we need to modify some of the UI behavior to not allow saves when a resource is locked. Backend REST endpoints should also deny saving if the resource has the UUID set.

pavgra commented 6 years ago

@chrisknoll, but the thing you are talking about sounds more like hash(ing function), isn't it?

@gklebanov, @chrisknoll, if I am going to use the UUIDs received from sources not managed by me (e.g. Atlas at client side), what is guarantee that the resource is trusted? That UUID generation at their side works correctly, etc?

gklebanov commented 6 years ago

@pavgra, @chrisknoll - this is why I mentioned that there are two potentially conflicting requirements and philosophies here, that we might need to break up and handle separately:

1) Create UUID the moment cohort created, it is independent on content. It is immutable, e.g. does not change, if cohort content is changing. It is a field that allows us to uniquely identify cohort across any site. It is possible to have two cohorts with the same content created by different authors, the purpose here is not to prevent that but to uniquely identify and track a shared artifacts. There would be different fields to handle versions and thus allow us to identify when that is the same original cohort but a different version and potentially different content (could be a simple hash-based field on that same cohort, just like we do it today in a similar use case).

2) Create UUID at the moment it is shared (e.g. published). it is dependent on content (generated by hashing) and thus mutable. It can used to prevent duplicates. But it cannot be used to track shared cohors as it would change with every content change.

Personally, I care about tackling use case #1 first, then #2.

Another factor we need to consider - UUID logic/implementation should not be tool specific and other cohort vendors (and we are not one of them) should be able to implement it. Again, this should be a first step in enabling us to create a standard around cohort exchanges, regardless what tool it would be coming from.

I propose Pavel and I will try to document the requirements and logic first (using information from this thread) and share it to get a consensus on what exactly we are trying to implement here - before we dive into technical solution for that?

gklebanov commented 6 years ago

and btw, apologies - to clarify - I am looking at the original @jduke99 submission - his use of UUID was more around the use case #2 as well. My comments on UUID and the use of it, in this and other threads, was targeting the use case #1. If we decide to tackle both, so be it - we will do that, but let's just document the requirements, have a clear proposal and then how we will do that and make sure all parties in this thread are on the same page.

chrisknoll commented 6 years ago

@chrisknoll, but the thing you are talking about sounds more like hash(ing function), isn't it?

It is not. I'm not proposing that these values get generated by content. They are randomly generated as @jduke99 described, and are virtually guaranteed to be unique when you generate. I'm suggesting that when you generate the Identifier, the content should become immutable so that anyone importing that definition will have the UUID taged to it, and you can use that UUID to reference cohorts in an analytical study. I've already thought about how this might work in CIRCE. But none of this works if we have a cohort with a UUID abc-def in one env and it's not the same definition abc-def in another environment. Yes, people can be naughty and change the raw data under the UUID, but I'm not suggesting we make everythign tamper proof, I'm just suggesting a way to share cohorts and reference them in analysis queries.

If you create identifiers by content then you run the risk of 2 separate definitions having the same identifying value, and that would be a conflict, so we don't want that.

@gklebanov, @chrisknoll, if I am going to use the UUIDs received from sources not managed by me (e.g. Atlas at client side), what is guarantee that the resource is trusted? That UUID generation at their side works correctly, etc?

I'm not sure what is here to trust: when you import the definition with the specified UUID on it, you're just locking that cohort definition to that UUID. If you find out it's an error or was tampered with, you just delete your local entity and re-import the validated one. Validation of entity sharing I think is another (but important) issue to tackle.

cgreich commented 6 years ago

@chrisknoll:

Why wouldn't you create the UUID from the content? That way you handle the uniqueness requirement, the locking requirement and the trust requirement. Hash the damn thing. If we want to make sure trivial changes (adding a whitespace) doesn't immediately create a new UUID we could have the hashing mechanism use only non-whitespace, or something like that.

chrisknoll commented 6 years ago

Because it wouldn't be a Universally Unique IDentifier if two different things resolved to the same identifiers. My understanding about idea of introducing a UUID was that it would be used to uniquely identify something. If that's not what we're doing here, I'm in the wrong thread.

Maybe I'm missing something..perhaps an analogy: If two people are named 'John Smith' are they the same person? If two cohort definitions hash to the same hash (which is very different from a UUID) are they the same cohort definitions?

cgreich commented 6 years ago

Ah! I see what you mean. I thought there are hashing algorithms that make sure that no two people are called John Smith. Or the likelihood of that is less than 1/number of all molecules in the universe or something.

anthonysena commented 1 year ago

Closing this issue due to age. For reference, when cohorts are referenced in network studies the circe-be JSON is included to translate it to SQL for execution which makes it fully specified. Sharing cohorts in a phenotype library is an on-going effort within the Phenotype Working Group