darwin-eu / CodelistGenerator

Identifying relevant concepts from the OMOP CDM vocabularies
https://darwin-eu.github.io/CodelistGenerator/
Other
12 stars 8 forks source link

Running into limitations with very large concept sets #209

Open ablack3 opened 2 months ago

ablack3 commented 2 months ago

I'm working on a study where I have several concept sets that need to be included in characterization.

These concept set expressions include some concepts in the regimen domain which is causing a problem because the CohortCharacteristics / patient profiles functions I need are failing since they don't support regimen concepts as far as I can tell.

Also some data partners have the regimen table and others do not since the study needs to run on different omop cdm versions. The PI did try to exclude the regimen concepts from the concept sets but when I ran the study I still found that many regimen concepts were being included causing the code to break.

As a fix I tried excluding the regimen concepts in my study code. To do this I'm using CodelistGenerator.

concepts <- CodelistGenerator::codesFromConceptSet(cdm = cdm, 
                                                   path = here::here("concept_sets"),
                                                   type = "codelist_with_details")

This function fails (on redshift) inside CodelistGenerator:::addDetails.

The reason is that when the concept set table that is passed into insertTable has about 300,000 rows. So basically we are starting with small json files (concept set expressions) and CodelistGenerator::codesFromConceptSet is downloading 300,000 concepts and then uploading them to the database which is failing on redshift.

Perhaps I could loop over these files one by one or something but I was wondering if we possibly improve how this function works so that it avoids uploading data to the database. https://github.com/darwin-eu/CodelistGenerator/blob/9038973c0a3d57a8a786194ae8d59800b4daeb02/R/codesFromConceptSet.R#L372

It seems like we could only upload the expression, realize the full concept list with decendents in the database, then join to the concept table to get details, and then download the result without the need to upload.

We could also make a change in insertTable to do something about uploading of large tables like perhaps batch upload or something like that.

screenshot from debugging:

image
edward-burn commented 2 months ago

hmmm yes, that is an extremely large set of codes..... Happy to work on optimising this for larger inputs, but that will take some time. Given your use case adding a function like subsetToDomain might also make sense (to go with the other subset functions)?

In the meantime I would suggest looping through files/ codelists and pass one by one to the function, or filtering your codelist to only codes uses based on achilles counts https://darwin-eu.github.io/CodelistGenerator/reference/restrictToCodesInUse.html