OHDSI / Atlas

ATLAS is an open source software tool for researchers to conduct scientific analyses on standardized observational data
http://atlas-demo.ohdsi.org/
Apache License 2.0
273 stars 137 forks source link

Significant speed improvements in Datasources component of Atlas #399

Closed PRijnbeek closed 6 years ago

PRijnbeek commented 7 years ago

Team,   Today Lars van Halvorsen, Michel van Speybroeck and I had a small Hack-at-thon on the datasources component of Atlas.

The goal was to understand better why the current version of Atlas that is pulling directly from the Achilles results tables, is so slow compared to the JSON approach of AchillesWeb. As you can experience yourself on the Atlas version currently on ohdsi.org  loading the reports is too slow to be of practical value. For example, loading the treemap of the measurements can take upto 2 minutes to build as we tested in different server specs. Also the drilldown reports are too slow to be of real use.   A solution could be to store the JSON in the database instead of creating them on the fly but we thought it would be better to look into the code and the queries a bit better first. Therefore, we ran EXPLAIN and ANALYSES on the queries and visualized the EXPLAIN reports of the query components when running the SQL directly on the server.

We found the follow solutions to the problem, which increased the speed of the datasources component in Atlas from 2 min to instantaneous (<0.5 s). Also, the drilldowns are now very fast.   What we have done to achieve this is the following:  

  1. Adding indexes of course helps but is not done by the WebApi when creating the tables or after populating them. Adding indexes on the strata or in combination with the min_levels_of_seperation improves the speed but not to a high enough level to make the tool responsive enough

  2. For the treemaps, joins are done to create the hierarchy upto three levels deep. This is the most costly operation (80-90%) and should be cached we think by creating a lookup table containing all the levels in the CDM schema (or the results schema). In this way this operation is done only once and can be re-used. In addition we believe that this join is often needed to traverse deeper in the vocab in all kinds of queries as well (we are not sure why only 3 levels are done by the way?). We tried this and the increase in speed is very significant.

  3. With 1 and 2 we get really very close, but a simple change made all the difference. In the SQL a lot of CASTs to VARCHAR are done to make the joins. This seems to be a costly operation and big speed improvements can be obtain bij casting the other side to INT instead and join using integers

We are making these hierarchy lookup tables and other changes for the other domains as well in our forked version of Atlas to make it fast enough for demo purposes and daily use. We did our tests on Postgresql servers with different specs but we believe this will also work for the other dbms.   We may need to decide where to make these lookup tables in the cdm schema or results schema and in what step. They could be made in Achilles R which would make the export to JSON a lot faster too for sure, in the WebAPI call when creating the Achilles tables, or in Atlas..?   Any thoughts on this or suggestions to make this useful for the whole community? Happy to have a call on this if that helps the team.

Thanks.   Peter, Michel, Lars

pbr6cornell commented 7 years ago

Thats great stuff, thanks guys!

I'm totally on board with #1 and 2, makes sense. For treemaps, at a minimum, the script should create the temp table, then join go it, rather than having that nasty query embedded as a subquery. there's probably a more elegant solution than that.

Re: #3, I know this is an environment-specific problem. I built it originally with CAST to INT and it worked great in 2 enviroments (I think SQL Server and postgres), but that fails on at least two environments (I forget the specifics, but pretty sure APS and Redshift ). The reason is the column contains a mix of varchar and int, and it comes down to where the optimizer does the casting. If it casts after you select the subset of rows you want, it works well, but if you cast before subsetting the rows, that produces a logical error. It could be handled though by doing steps successively, rather than in individual queries.

In any case, getting the datasources to be more performant is definitely something we should push for, I agree the timing can be tedious (particularly when you've got stand-alone ACHILLES open on another tab doing the same thing much faster).

On Tue, May 2, 2017 at 4:10 PM, Peter Rijnbeek notifications@github.com wrote:

Team,

Today Lars van Halvorsen, Michel van Speybroeck and I had a small Hack-at-thon on the datasources component of Atlas.

The goal was to understand better why the current version of Atlas that is pulling directly from the Achilles results tables, is so slow compared to the JSON approach of AchillesWeb. As you can experience yourself on the Atlas version currently on ohdsi.org loading the reports is too slow to be of practical value. For example, loading the treemap of the measurements can take upto 2 minutes to build as we tested in different server specs. Also the drilldown reports are too slow to be of real use.

A solution could be to store the JSON in the database instead of creating them on the fly but we thought it would be better to look into the code and the queries a bit better first. Therefore, we ran EXPLAIN and ANALYSES on the queries and visualized the EXPLAIN reports of the query components when running the SQL directly on the server.

We found the follow solutions to the problem, which increased the speed of the datasources component in Atlas from 2 min to instantaneous (<0.5 s). Also, the drilldowns are now very fast.

What we have done to achieve this is the following:

1.

Adding indexes of course helps but is not done by the WebApi when creating the tables or after populating them. Adding indexes on the strata or in combination with the min_levels_of_seperation improves the speed but not to a high enough level to make the tool responsive enough 2.

For the treemaps, joins are done to create the hierarchy upto three levels deep. This is the most costly operation (80-90%) and should be cached we think by creating a lookup table containing all the levels in the CDM schema (or the results schema). In this way this operation is done only once and can be re-used. In addition we believe that this join is often needed to traverse deeper in the vocab in all kinds of queries as well (we are not sure why only 3 levels are done by the way?). We tried this and the increase in speed is very significant. 3.

With 1 and 2 we get really very close, but a simple change made all the difference. In the SQL a lot of CASTs to VARCHAR are done to make the joins. This seems to be a costly operation and big speed improvements can be obtain bij casting the other side to INT instead and join using integers

We are making these hierarchy lookup tables and other changes for the other domains as well in our forked version of Atlas to make it fast enough for demo purposes and daily use. We did our tests on Postgresql servers with different specs but we believe this will also work for the other dbms.

We may need to decide where to make these lookup tables in the cdm schema or results schema and in what step. They could be made in Achilles R which would make the export to JSON a lot faster too for sure, in the WebAPI call when creating the Achilles tables, or in Atlas..?

Any thoughts on this or suggestions to make this useful for the whole community? Happy to have a call on this if that helps the team.

Thanks.

Peter, Michel, Lars

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OHDSI/Atlas/issues/399, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsrGqR9JL46tpv7KdWxdLcRSXjC13N4ks5r143FgaJpZM4NOoAL .

t-abdul-basser commented 7 years ago

@PRijnbeek Thanks for this analysis!

Some thoughts

fdefalco commented 7 years ago

Two thoughts

PRijnbeek commented 7 years ago

@pbr6cornell Yes the CAST can be a platform specific thing indeed, but the speed improvements are really very big so we will think a bit more about this.

@t-abdul-basser

Yes indeed i saw a branch on caching but for caching you stil have to wait for all treemaps and drilldowns at least once, which I think is not preferable either. You can loose users very quickly.

It maybe be platform dependent indeed although even the internal JnJ on the parallel datawarehouse is not optimal in speed. I have been runnning on a very fast new server (768Gb memory 50+ cores) after tuning Postgres it is still slow (although faster than the version on OHDSI.org). I think it is important the one on OHDSI.org is fast to convince new potential users.

Yes indexing is a simple first step that can be added in Achilles R as a last step.

The lookup table is the result of the inner 3x join. I was referring to storing this result somewhere. We could make that during Achilles R or even already when instantiating the tables in WebApi since it is not (should not be) dependent on the data (there is no need to only do this for the concept ids that are in the dataset as is currently done anymore I think the table can be the same for all users since it is only dependent on the vocab version).

@fdefalco: Yes i agree therefore we should think about a procedure to test on different platforms because the general user will definitely not have the infrastructure in JnJ.

Yes mongodb is possible but maybe not the fastest solution, a blob may be an option but could be platform dependent possible? . Having another solution for storing this could also be interesting. At Erasmus we are experimenting with Solr to store the Vocab but especially our unstructured text data to test NLP pipelines that have been developed. We indexed all the textual information and the first results are very promising. It is very fast and we are looking at Bag of Words and other approaches we could use in our Plp pipeline for prediction.

fdefalco commented 7 years ago

Good to hear. I have a branch of ATLAS / WebAPI using Lucene (the technology behind Solr) that I'm using to replace the default vocabulary search features and have seen similar performance improvements.

t-abdul-basser commented 7 years ago

@PRijnbeek

Yes a single hit for each report would be required although this could be run as a set of batch jobs on start up.

Qualitative description of DS report performance on the JnJ PDW test system is helpful. It might be useful if we posted metrics for our various test environments in order to facilitate comparison and agreement on the response time ranges that constitute acceptability. For now my rough idea is 'as fast as or a bit slower than the Heracles visualizations'--but I concede that this needs to be quantified :)

Ah! In that case I would expect 1) that the lookup tables would be placed in the results schema since they would be in support of results-related functionality (visualization of data source characterization/summarization stats) and 2) that they would be created and populated post-results generation in Achilles R modules.

@fdefalco @PRijnbeek Perhaps a two phase effort 1) introduction of specific indexes and look up tables (in Achilles R) as a first step and 2) Solr/Lucene based solution as next (though potentially-in-parallel) phas?

gklebanov commented 7 years ago

It is great to hear you have improved this performance so much! I am actually aligned with Frank (@fdefalco) on a few of his comments:

I think a call to brainstorm these would be great

PRijnbeek commented 7 years ago

@gklebanov

I agree looking into better mechanisms to host the vocab (and definitely the unstructured textual information that will be added soon) needs to be explored.

SOLR or Lucene is something to explore. SOLR has some nice NLP features out-of-the-box and scales very nicely and has a R package to communicate (will be trying this out in the next weeks).

However, for now we really need a solution because it is simply not possible to demo the data sources tools / get people to use it if it is so slow as the current version (also true on the OHDSI website). We need to demo it soon to some important future users etc so we will fix it for now as we proposed using Achilles R to populated a lookup table in the result schema.

gklebanov commented 7 years ago

@PRijnbeek - absolutely - 100% agree. Get it working and then optimize, best approach

t-abdul-basser commented 7 years ago

@prijnbeek Agreed. That makes sense. I would also propose that we ask @fdefalco to add this to the Architecture WG discussion agenda in the near future--perhaps meeting after next.

cgreich commented 7 years ago

@PRijnbeek, @gklebanov:

Are you guys ganging up or something? :) Greg has been singing this song into my ear for the longest time.

chrisknoll commented 7 years ago

One thing that we weren't able to do with the raw JSON approach to AchillesWeb is to apply a runtime filter on Achilles results that could limit the results to those concepts in a particular concept set expression (for example). I think there are some other cases where dynamic queries on the results data would be useful. When we got to real-time queries of Achilles results in atlas, that was one of the doors that was opened.

If the results are migrated from database tables into document storage, is there an equivalent mechanism that lets someone query for results based on some additional query parameters, and what is the effort to enable that usecase if it is possible?

alondhe commented 7 years ago

Hi all -- I've been working on a version of the Achilles R package that separates each of the analyses into separate SQL CTAS scripts and would use the cluster package to parallelize the execution before creating one unioned table. Is the intention here to optimize the dissemination of the Achilles results into Atlas, or in the generation of the Achilles results? I want to make sure I'm not duplicating work here or developing a solution that may not be the most optimal for this future state being discussed.

anthonysena commented 7 years ago

Adding myself here as an interested party that would like to participate in discussions on this topic. @PRijnbeek thanks for digging into this and for these proposals. As a thought, it might be useful to set up a call among people on this thread to discuss the solutions that Peter and team have proposed and use the architecture calls to think through longer-term objectives such as SOLR, MongoDB, etc.

leeevans commented 7 years ago

I'd also be interested to participate in discussions on this topic.

Some aspects of the discussion in the below issue may be relevant input in regard to considering an alternative ohdsi results storage approach (such as a document store):

https://github.com/OHDSI/WebAPI/issues/140

PRijnbeek commented 7 years ago

@chrisknoll Yes i think the move to queries was also needed for the security layer right? I think the approach is fine as long as it is fast enough. subsetting on concept set is a nice thing indeed.

@alondhe no there is no duplication with your work. I think parallelizing Achilles results can be good thing because it is quite slow. It may actually become faster if the hierarchy table is made first and is then re-used also in Achilles R (not looked into the effect of that but we will).

By the way we now have version of Atlas in Amazon that has this fix applied and it is now very fast to work with the datasources part. Some final tweeks are being made on the Achilles R side that now produces the hierarchy lookup tables, after this we will do a pull request to develop to get your detailed feedback.

We also added a new graph in a couple of domains that shows you, e.g., the number of patients with x or more of the concept as a scaled histogram. This is very handy for questions like "How many people are in the database with at least X BMI measurements etc.?"

chrisknoll commented 7 years ago

@PRijnbeek , the security layer is handled through WebAPI URI request filtering, so we can secure anything that is interfaced via the URL. But any dynamic filtering..I'm not sure how that'd be implemented, but someone might have the answer to that.

anthonysena commented 6 years ago

Closing since this was addressed as part of OHDSI/WebAPI#227. Part of this conversation also relates to OHDSI/WebAPI#580, OHDSI/WebAPI#140 which are still open so I'm linking this issue to those conversations.