ellessenne / comorbidity

An R package for computing comorbidity scores.
https://ellessenne.github.io/comorbidity/
GNU General Public License v3.0
80 stars 21 forks source link

Sparse table support - new feature #71

Closed SoccerGeekPhD closed 4 months ago

SoccerGeekPhD commented 8 months ago

The wide data frame returned by comorbidity may consume a lot of memory and does not match the input format.

Would changing the output to a sparse format of two columns {ID, Name} just like the input to the function be helpful for memory and performance?

For example, my data science team in a large insurance company takes a claim with up to 9 ICD10 codes on a claim with a specific service date. This claim is pivoted to {patientID, ICD10, date, ...} then filtered to unique {patientID, ICD10} over a time frame to create the input to comorbidity(). So can you keep this narrow format for output?

ellessenne commented 7 months ago

Hi, thanks for reporting this. I wouldn't change the default behavior, but I am keen on adding an option to return a long format dataset instead. Let me have a look into it!

ellessenne commented 4 months ago

Hi again @SoccerGeekPhD, I experimented a little bit on this and can share the following.

  1. First, for background, all internal calculations are already done in long format, which is then "reshaped wide" for the output;

  2. I tried omitting the final step of the algorithm (the "reshape wide" part), and that reduces computation time by <10%. We are talking about a reduction of <1 second on a test dataset with 10,000,000 codes from 100,000 subjects.

  3. Moreover, the resulting dataset in long format uses about the same amount of memory compared to the wide dataset - actually a little more, 7.2 MB vs 7.38 MB for the wide and long output formats, respectively. The main difference is that the long datasets needs to store information regarding which comorbidity condition was detected for every input subject, which takes more memory than storing ones and zeros in the wide output format.

  4. Updating the internal algorithm to use sparse matrices (e.g., relying on Matrix::sparseMatrix()) would be a huge undertaking as it requires rebuilding and refactoring the entire package from the ground up, which I definitely do not have the time to do at the moment.

I was wondering, do you have any specific benchmark on the performance of the package that you can share here? I will close this issue for now, but feel free to reopen it if you have more details to share.

Alternatively, I would recommend checking out the {icd} package (currently only available on GitHub), which uses sparse matrix multiplication and compiled code under the hood, see e.g. this draft paper. Note that the {comorbidity} package is much faster now, so the benchmarks in the article are not accurate anymore.

I hope this helps,

Alessandro