OHDSI / Atlas

ATLAS is an open source software tool for researchers to conduct scientific analyses on standardized observational data
http://atlas-demo.ohdsi.org/
Apache License 2.0
273 stars 137 forks source link

Cohort references in Estimation #261

Closed anthonysena closed 7 years ago

anthonysena commented 8 years ago

If a cohort is deleted from the system and it is referenced in an estimation project, the estimation project will fail to load.

pbr6cornell commented 8 years ago

To reframe this slightly:

Let's consider, instead of an external reference to a cohort id, that instead selecting a cohort embeds the cohort, and its associated conceptsets into the 'estimation' project, so that its fully self-contained. (akin to how conceptsets get embedded into cohort definitions). then, when we write the R code, instead of referencing cohortid, we can instantiate the SQL for the cohort definitions, place in a temp table, and reference them in the rest of the package, like how we've done other prior studies. this will make the package transportable across instances.

On Thu, Oct 13, 2016 at 4:37 PM, Anthony Sena notifications@github.com wrote:

If a cohort is deleted from the system and it is referenced in an estimation project, the estimation project will fail to load.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OHDSI/Atlas/issues/261, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsrGvlLgM9cARr4cGEb5drDZ_Zdk-U5ks5qzpaPgaJpZM4KWVyp .

chrisknoll commented 8 years ago

I think this is a slippery slope here: If we do this for estimation, will we do this for IR as well? imagine a 12x18 T to O IR calc project. I'd have to manage importing 30 cohorts? Sounds painful. And then if i was using those same cohorts as inputs to my estimation analysis, i'd have separate copies of these cohorts making it bit harder to keep in sync.

Your reference to concept sets being embedded into cohorts is synonymous with what you're proposing here, but I think there's important differences as to the implications.

The cohort method tools allow you to provide a 'cohort table' specifically for this purpose, it just so happens that everything kicked off via the WebAPI will be assuming that the cohort table is in {results_schema}.cohort. Seems that the cohort method was built to support this kind of use case.

If we need transportablity, we can provide an export facility that will wrap dependent cohorts and settings into a type of package file, and then we can separately have an import tool that would import this package into individual cohort defs and analysis projects. That's what I can think of off the top of my head, but i'm sure a little more planning could make it more robust.

-Chris

anthonysena commented 8 years ago

So this bug came about since I had deleted a cohort that was referenced in an existing estimation project and the code does not handle this properly. More specifically, the cca table in the OHDSI schema had a cohortID reference that would not resolve and thus it broke the estimation component. I still like the idea of referencing the cohort by the ID in the cca table so that changes to the cohort can be dynamically loaded into the estimation project.

That said, in Estimation -> Utilities -> Export, I'm writing out the full cohort definition as part of the JSON export this way we port the cohort definition along with the estimation specification. However, when importing the estimation, I'm not attempting to create a new cohort definition based on the specification which does limit the ability to port the definition from one environment to another (new bug/issue) .

Furthermore, the R code that is produced currently allows you to specify the table(s) where your exposure and outcome cohorts reside along with the identifiers with the intent being that people may have their cohorts stored in different tables than the {results_schema}.cohort. It should be possible to generate the SQL that is used to generate the cohort and embed this as part of the R package (perhaps this should be externalized in a separate companion function) so that someone could decide where to put their cohort if they don't want to use the default {results_schema}.cohort table.

chrisknoll commented 8 years ago

k, to prevent the improper deletion of a cohort, we could set a FK relationship in the database that will fail a deletion of a cohort if it's found as the cohort_id in the estimation table. Then we can catch the error and report a message to the user (when they try to delete the cohort) that it's in use (and maybe even go a step further giving a report where we find the cohort in use (Estimation A, B, C...IR Analysis D E F) etc.

anthonysena commented 8 years ago

I like that approach - it would be good to ensure that someone understands the impact of the deletion and to also give us some referential integrity between these entities.

fdefalco commented 8 years ago

At the risk of opening a can of worms I would suggest we might want to avoid hard deletes and overwrites on all OHDSI assets. This would mean only 'soft deletes' where things are marked as 'archived' and versioning of our assets such as concept sets, cohort definitions, estimation projects, instead of overwriting their state in the OHDSI repository.

Being certain that all research is reproducible because we never delete an asset or question what state a definition was in when a particular analysis was executed would be an improvement over our current situation where we can find ourselves unsure of what was the state of things when the analysis was run. I understand introducing soft deletes and versioning is a MUCH BIGGER discussion and would involve a lot more work than resolving this issue - still - I thought it was worth pointing out my opinion here.

t-abdul-basser commented 8 years ago

@fdefalco I can of course see the utility of adopting this approach but agree that before adopting such a policy we would need to discuss as the design and implementation effort issues are non-trivial--perhaps in a future Arch WG meeting?

anthonysena commented 7 years ago

I've committed some new code to address this issue. Essentially, this makes Atlas & WebAPI more fault tolerant in the event that a dependent cohort or concept set reference is removed in the system. Now, instead of the interface simply hanging, the user will be able to properly load the estimation specification and the interface will provide a warning that the cohort/concept set is no longer available.