Re-use generated cohort across analyses

pavgra commented 5 years ago

As we implemented self-sufficient analyses which can be executed on their own w/o non-transparent pre-requirements (e.g. a need to generate cohorts before executing IR), we can focus on performance and introduce repo of shared analyses results which could be utilized as cache and let us running e.g. cohort generation only when there are no results for a given cohort design (hash) yet.

chrisknoll commented 5 years ago

I'm glad we are coming around to addressing a performance deficit that was introduced with these changes (eg: cohorts are always re-generated when executing IR, a behavior that did not exist prior).

There is one important detail that hasn't been addressed since we introduced generations based on 'design snapshots': cleanup. part of this work should implement proper cleanup procedures so that valuable resources can be recovered after a design is no longer needed/referenced/wanted. When proposing your enhancements, be sure to address resource cleanup in your proposal.

pavgra commented 5 years ago

Requirements:

Results for the re-usage must be stored in a database in relational, normalized format (e.g. no JSON / JSONB or other NoSQL / JCR data structures) since cohort records are used by different types of databases and environments (R / Java / SQL)
The mechanism must ensure consistency at a database level due to usage by different environments. I.e. R script should not be able to remove half of the cohort records while Atlas would still think that results are cached and could be re-used
The mechanism should be generalized enough to support extension for different types of results, i.e. not only cohort records
The mechanism should support caching by design to allow the following workflows:
- There are two different cohort definitions (two different records with different IDs) in the system which define the same design. If the first definition has already been generated and its results were persisted, there is no need to run generation from scratch for the second definition
- Some analysis utilizes an older version of cohort, cohort which has been generated but later was changed - analysis should be able to use cache
The mechanism should provide the ability to clean-up outdated results:
- Manually erase results that have been cached before a certain date
- Manually erase results that have been cached for an entity which was updated more than N times since the caching occurred
- Scheduled erasure of cached results
The caching should be unified for different entities: e.g. now we have isolated generation in IR, Pathways, CC, and non-isolated generation in Cohorts

anthonysena commented 5 years ago

The mechanism must ensure consistency at a database level

I'd also add here that this mechanism should globally prevent a cohort generation when it is being used during an analysis (i.e. in IR, Pathways, CC). Presumably a user should not alter the contents of the target table that holds the subject_ids otherwise it could corrupt the analysis. I think that is what is implied here but would like it explicitly stated.

The mechanism should be generalized enough to support extension for different types of results, i.e. not only cohort records

I agree with this in principle provided we focus our efforts here on cohort re-use. If we identify ways to generalize the approach for other areas, I'd suggest we think about those for the 3.0 release.

pavgra commented 5 years ago

I'd also add here that this mechanism should globally prevent a cohort generation when it is being used during an analysis (i.e. in IR, Pathways, CC).

Yep, that's the core idea of the issue

Presumably a user should not alter the contents of the target table that holds the subject_ids otherwise it could corrupt the analysis.

I was thinking of this one and would suggest another approach: detect if the records were modified and invalidate the cache/lock record in case of such event. Otherwise, you'd need to limit user's ability to work with the table, which I believe is not the right way - data is the primary thing, the cache is secondary.

I agree with this in principle provided we focus our efforts here on cohort re-use.

Right, the idea is to define a general interface but work only a one, cohort related, implementation.

anthonysena commented 5 years ago

detect if the records were modified and invalidate the cache/lock record in case of such event.

I'd have to understand how this detection would happen. The cohort table is managed by WebAPI - if the records are modified, it would be controlled through WebAPI. No outside processes should modify that table. It feels like we need a semaphore - we're locking the records in the cohort table (for a given cohort ID) while another process depends on them. Is this what you are proposing with the cache/lock record?

pavgra commented 5 years ago

No outside processes should modify that table.

This is about item 2 from my list. You cannot guarantee that this "should" is respected. Therefore, all restrictions should happen at the DB level.

I'd have to understand how this detection would happen

I thought of simple DB trigger

gowthamrao commented 5 years ago

This was probably discussed, and we should mark it as a known limitation. If incident rate, CC, Estimation or prediction are dependent on the cohortId of the local instance of Atlas -- then those when ported to another instance of Atlas will not run - because cohortID's may not be the same. We just need to make sure the user is aware of it.

Another issue is the cohort name's. It seems to me the JSON object of IR, CC, Est, prediction has the cohort name's buried in it. I would rather prefer doing a WebApi call for the cohort name - everytime the IR, CC, est, pred loads in Atlas. Several times, I have changed the name of the cohort, and it does not propagate into the IR specification.

pavgra commented 5 years ago

@gowthamrao , please let's keep the issue dedicated to the caching question.

If incident rate, CC, Estimation or prediction are dependent on the cohortId of the local instance of Atlas -- then those when ported to another instance of Atlas will not run

This is not how it works now. They do not depend on the local IDs anymore.

Several times, I have changed the name of the cohort, and it does not propagate into the IR specification.

Just tried, it propagates.

pavgra commented 5 years ago

Proposed design

Data layer diagram

Class diagram

Sequence diagram

Cache lifecycle

anthonysena commented 5 years ago

From review with @chrisknoll and @pavgra

resultsIdentifier requires more description in the design and how it relates to cache-able entities.
Need to use design hash instead of the results identifier in the CacheGenerationService
Need to relate cache items to a sources
Discussed data integrity via a trigger on the results schema. This seems infeasible given the number of DBs supported so we would like to have further investigation/discussion around some type of CHECKSUM approach that would approximate the expected data integrity of the underlying results.

chrisknoll commented 5 years ago

For 4 (checksum), this sort of mechanism is used to verify that the first 14 digits of a number are entered correctly by adding a trailing 'checksum' digit: https://www.gs1.org/services/how-calculate-check-digit-manually

We could potentially implement something cheap an fast to digest the results records into some checksum value. I wonder if @schuemie or @msuchard have any statistical technique for generating checksums from arbitrary column data structures.

pavgra commented 5 years ago

@chrisknoll , @anthonysena , seems like I've been too skeptical regarding the hash calculation inside a database.

I've tested hash calculation for a cohort with 24k people in it:

SELECT SUM(('x' || hash)::bit(32)::bigint)
FROM (
  select md5(
    coalesce(md5(CAST(subject_id AS VARCHAR)), ' ') ||
    coalesce(md5(CAST(EXTRACT(EPOCH FROM cohort_start_date) AS VARCHAR)), ' ') ||
    coalesce(md5(CAST(EXTRACT(EPOCH FROM cohort_end_date) AS VARCHAR)), ' ')
  ) as hash
  from results.cohort
  where cohort_definition_id = 3
) list

The query executed in 142 ms using my local database (running on core i7 / 32gb of ram) which seems more than satisfactory as for me.

Hash functions are well supported across OHDSI supported DBMSes:

Oracle: ORA_HASH
PostgreSQL: md5
RedShift: md5
Impala: fnv_hash
Netezza: hash
BigQuery: md5
PDW, MS SQL: HASHBYTES
MySQL: md5
SQLite: SHA1 should be supported with the extension, otherwise - user-defined function

chrisknoll commented 5 years ago

Great. I don't think we need to worry about DB translation (in SqlRender) for this function, i think we can have dialect-specific implementations in WebAPI to calculate the hash of a group of rows/columns. This caching this is more of a WebAPI function, not an OHDSI toolstack thing, but I understand if you don't' agree with that...just suggesting a simple implementation.

Other thing I noticed about those functions is that it's a hashes a single column...which we could make it hash a group of columns together to get a MD5, but then how do we get the hash of the set of rows that belong to one specific result in the table? Is there a way to aggregate a bunch of MD5s for each row into 1 final MD5 value that we store in the the cache entry?

pavgra commented 5 years ago

Here are the diagrams updated based on the discussion points:

Data layer

resultsIdentifier is a discriminator value used to retrieve results related to the entity of interest, e.g.:
- SELECT * FROM @result_schema.cohort WHERE cohort_definition_id = @resultsIdentifier
- SELECT * FROM @result_schema.cc_results WHERE cc_generation_id = @resultsIdentifier

Class diagram

Sequence diagram

Cache lifecycle (1)

pavgra commented 5 years ago

@chrisknoll ,

Is there a way to aggregate a bunch of MD5s for each row into 1 final MD5 value that we store in the the cache entry?

This is exactly what I've shown in the given example. First, it calculates a hash for each row and then it aggregates those hashes by SUMming them into a single number (see SUM(('x' || hash)::bit(32)::bigint))

chrisknoll commented 5 years ago

Did not realize you could just sum hahes like that. That won't result in some sort of numeric overflow? what do you get when you hash a cohort results of 1,000,000 records where each hash of each row is added together?

pavgra commented 5 years ago

Yep, I've forgotten to note the potential overflow issue and didn't want to overcomplicate that example by additional steps.

The overflow problem could be avoided by applying a heuristic method which would allow lowering the end sum. E.g. do not transform the hashes into numbers directly but include an intermediate step which splits the hash into several parts:

...
SELECT sum(('x' || substring(hash, 1, 8))::bit(32)::bigint),
  sum(('x' || substring(hash, 9, 8))::bit(32)::bigint),
...

Sample:

989680 in HEX = Decimal 10,000,000 (10mil), but ->
HEX 989 (first three symbols of the HEX in step 1) = 2441 in Decimal + HEX 680 (the second three symbols) = 1664 in Decimal
2441 + 1664 = 4105

And I believe this is not the only one potential approach.

anthonysena commented 5 years ago

@pavgra per our discussion with @chrisknoll on the Atlas WG call, the design here looks good to proceed with the implementation. As discussed, one item to be aware of during the implementation relates to deleting cohorts from the system.

In WebAPI there is a Tasklet which performs cleanup of cohorts from the results schema for each data source where it was generated. This tasklet will require an addition to also clean up the GenerationCache entries as well.

If you run into any further questions during the implementation, please post them to this issue and we can set up time to discuss as required. Thanks!

OHDSI / WebAPI