Atlas 2.13.0 returns wrong values for RC/DRC and PC/DPC

TomWhite-MedStar commented 12 months ago

Expected behavior

When you use Search for a concept and select a data source, Atlas will show the correct counts for RC/DRC and PC/DPC.

Actual behavior

When we first select a new datasource (let's call it source A) after doing a Search (or anywhere that RC and PC are shown - including in global concept sets or those within cohort definitions), the correct query will be run (confirmed by reviewing developer tools for API calls and monitoring queries issued to database), but once the cursor stops spinning, Atlas will show wrong values (sometimes 0, sometimes wrong values) for RC and PC for those concepts. If you then switch to another data source (let's call it source B) and then switch back to source A, the query will not be re-run (confirmed by developer tools and database logs), but Atlas will then show the correct RC and PC counts for those concepts. Those correct values are stored in the results.achilles_result_concept_count table - but Atlas displays the wrong values until one switches away from and then back to the desired data source.

This has been an issue since 2.13.0 was released.

Thus, it appears that there is an internal caching issue. The right WebAPI command is called to trigger the query to get the RC and PC values, but those final values are not then available when the DC/PC results are displayed until the user switches away from the data source and then returns to it.

Steps to reproduce behavior

We have confirmed this repeatedly on our 2.13.0 instance, which happens to be connected to Databricks. Since we can confirm that the appropriate queries against Databricks are run (by reviewing its query logs) and that the right row counts are returned, I doubt this a Databricks-specific issue. Atlas also worked fine for Databricks prior to version 2.13.0.

Update on 1/1/2024

With the additional information, we think the steps to reproduce are as follows:

Clear cdm_cache (truncate) so that there are no cached record counts in the cache table.
Perform a vocab search (or open a concept set that will fetch included concepts) which will trigger a row count lookup on server
Observe that the record count REST endpoint is invoked, and this should lead to a record fetch over to the CDM source's acchilles_results_concept_count.
Observe that the RC/DRC columns should appear as 0 (provided this bug is happening as we expect). We should check the REST payload to see what the values were in the response. If they are 0, then the UI is presenting the info that it was given. If it is not zero, then the UI is presenting zeros when the payload has actual values for some reason. (See below: Tom reports that the counts did come back as 0 on the first request)
Switch the record count datasource to another source.
Then switch the record count datasource back to the origional data source. We should see the proper RC/DRC values in the table. Check the REST payload to see if it returned the correct values.

Additional note: @TomWhite-MedStar did provide the info about the 'first cache' request:

When I review the Response record for those queries, it matches what I see on the screen -- e.g. when I first call that newer data source (https://dev-ohdsi-atlas.medstar.net/WebAPI/cdmresults/PHI7Y4M20230612/conceptRecordCount), it lists the correct concept IDs in the request payload, but the response JSON shows only 0 values.

So, the part of the code we should focus on is where the requested concept Ds are fetched from the CDM, and after the counts should be inserted into cdm_cache, it should return those non-zero values back to the service request. But, according to what @TomWhite-MedStar reports above, it does not...the REST request returns 0s.

@konstjar : this is just FYI.

chrisknoll commented 12 months ago

I'll have to dig into this, the record count caching has been a continuous problem since 2.10.

I'm fairly certain that the only table that matters is the cdm_cache table. We fetch the results from the results.achilles_result_concept_count table to cache into the WebAPI database, but I'm fairly certain that we always read from the cache.

Can you confirm a few things:

1) What are the values in your webapi.cdm_cache table? Do they match the values from the results.achilles_result_concept_count? 2) Is the underlying CDM data changing in some way between refreshes? I'm trying to understand how you would get the wrong values when you first load the page from source A, but switching to B and then back to A returns the correct ones.

TomWhite-MedStar commented 12 months ago

@konstjar , you have looked into this for us. Can you answer Chris's question or provide more insight about what you have tested and your theories of what is happening?

I believe one of the tests we did was to clear all of the caching tables and restart WebAPI, without success.

chrisknoll commented 12 months ago

I believe one of the tests we did was to clear all of the caching tables and restart WebAPI, without success.

That might explain some of it: if you clear the cache tables, restart webapi, the warmup process will attempt to refresh the cache. However, if you execute a search during that process, the cache will be loaded for just the requested concepts. I'm not sure how that would introduce a failure, but there are a few 'moving parts' in that case that might be more complicated to trace.

chrisknoll commented 11 months ago

Doing a little research on my side to see what happens in our own environment.

One thing that's happening is that when I do a concept search and get a table of concepts, the UI is showing the selected datasource as the first in our list of sources alphabetically, but the actual record counts are comming from the 'default' vocabulary:

This shows that Actelion is the source of RCs (but these RCs are from our JnJ Network):

Switching to another source:

Note we get timeout here because this source doesn't even have the achilles record count table so it results in a timeout error vs. zeros.

Switching back to the original one (Actelion):

When the switch was made, the request was issued, but Actelion is also a source where we don't have record counts, so we get same result (timeout in the RC column).

Finally, switching back to our default vocabulary where we know we have RCs:

These are the 'correct' record counts from our default datasource.

So, is the problem a possible UI sync issue where it is not defaulting to the 'default vocabulary' by the default in the source dropdown, but those RCs are comming, and then as you switch the source selection, it is properly loading the RCs but just not from the source you think it is (ie: your default vocabulary is not the vocabulary that is selected in the source)?

TomWhite-MedStar commented 11 months ago

@chrisknoll , we're seeing something similar. Regardless of which data source we specify should be used for vocabulary and record counts (in configuration), Atlas shows one of the first sources we registered, but appears to query off of a different source.

When I use Developer tools, I can monitor the API calls like below. For reference, our default database for vocabularies and record counts is called PHI7Y2M20230424.

If I search for "sarcoma", I get:

However, the View record count is for an ancient one:

When I switch to the default data source: ... none of the values change - so that initial query seem to query the default, but showed the wrong source in "View record count for:"

Trying something new (searching for lipidosis), all the initial values for RC/PC are 0, and it shows that same wrong initial data source. When I switch the display to the correct data source, I see reasonable values. When I then switch to a newer data source, I see all zeros again. However, if I switch back to the default data source, and then back to the newer source, now the #s are non-zero and make sense.

When I review the Response record for those queries, it matches what I see on the screen -- e.g. when I first call that newer data source (https://dev-ohdsi-atlas.medstar.net/WebAPI/cdmresults/PHI7Y4M20230612/conceptRecordCount), it lists the correct concept IDs in the request payload, but the response JSON shows only 0 values. However, when I review the next time that exact same API is called, I see non-zero values in the JSON response.

So, I think there may be two problems that are interacting here.

chrisknoll commented 11 months ago

That is realllllly wierd that the exact came endpoint results in different concept counts (0s in one call, non-zero in another). I need to think about how that might happen.

The issue where the dropdown is showing the wrong value, I think that's just a matter that the 'selected RC datasoruce' is not defaulting to the default vocabulary. That is something that should be known at application startup, and then assigned to an 'application variable' and that dropdown should bind to it and let you change the RC configuration. I'll try to track own the original dev who worked on that behavior and see if it can be fixed.

But the main focus of this issue should be the different results from the same endpoint call. That is a WebAPI concern for sure....the other I would consider an Atlas bug so an issue can be opened up over there to address the UI problem.

TomWhite-MedStar commented 5 months ago

@chrisknoll - @konstjar just upgraded our instance to Atlas 2.14.1 and WebAPI 2.14. Unfortunately, this behavior is still present.

To summarize, when we first query data source A (e.g. to see RC/DRC for a concept set), we see all values listed as 0. If we either refresh the page, or switch to data source B and then back to data source A, then we see the correct RC/DRC values.

When we monitor the query logs in Databricks, we see a database query when we first query data source A (even though only values of 0 appear on the browser). When we switch to a different data source and back again, there is no repeat query to data source A, but now the correct RC/DRC values do appear. Note that we don't need to wait for the query against data source B to complete - we can switch immediately back to data source A and then the correct values appear.

The only other nuance might be that we are required to use Microsoft Edge. However, it uses a version of Chrome that is > v.63, so should be supported.

chrisknoll commented 5 months ago

Thanks for the update, and I'm grateful for the additional information. So, if you're seeing the initial query to source A, then the code is fetching all the concepts it doesn't have in cache, and after fetching them it should save to the WebAPI db, and then return the counts from the cache table (which is in WebAPI). I understand you're saying that initially returns all 0's. When you switch to the different datasource and back to A, it should now have found all requested concepts in the cache, which means there's nothing to fetch from Source A so it seems like that part is working and you are getting counts from the webapi cache. What's confusing to me is that it should always be returning coutns based on the WebAPI table, it's just that the first time it inserts into the table. But in both cases (needed to fetch from source vs. no fetch required just return the counts) it reads the counts from the webapi cache.

So the core of the issue is the first cache request: when it identifies there are no concepts in the webapi cache, and it needs to fetch the concepts from the source, why is it returning zero for everything that first time, but then subsequently it DOES find the records in the cache and return those? At least we have steps to analyse and determine if we can reproduce.

chrisknoll commented 5 months ago

I've added updated steps to reproduce into the issue report, hopefully we can have this reproduced.

TomWhite-MedStar commented 5 months ago

@chrisknoll - Attached is a video showing how to replicate the behavior I am seeing.

https://github.com/OHDSI/WebAPI/assets/75700951/c19adfc4-2e50-486b-a87b-0f32ae491f70

TomWhite-MedStar commented 5 months ago

@chrisknoll and @konstjar , another option to consider is to add a configuration option to not use the cache at all for these searches. Since the data are pre-computed in the Achilles tables, it may be faster to simply call the database each time.

That would be a workaround rather than a bugfix, but may help some users (including us).

anthonysena commented 1 week ago

Linking to #2274 so we can have some options for controlling the cache to see if this helps.

OHDSI / WebAPI