AI-SDC / ACRO

Tools for the Automatic Checking of Research Outputs. These are the tools for researchers to use as drop-in replacements for commands that produce outputs in Stata Python and R
MIT License
15 stars 2 forks source link

add survival analysis table function #145

Closed mahaalbashir closed 1 year ago

mahaalbashir commented 1 year ago

This is not the full implementation of the kaplan-meier. In this version:

  1. The researcher calls the "surv_func" and chooses the output type: a table or a plot. The other parameters for this function are the event times and the status of the event (death or censoring).
  2. The function calculates the survival table and checks if: survivor(t-1) - survivor(t) < Threshold.
  3. If the number of survivors at a time violated the threshold rule then the mask value for each cell in that row will be true (that is to get the suppression function to suppress the whole row when the rule is violated. because any column can be calculated from the others so suppressing only the "number at risk" column doesn't mean that the table is safe).
  4. If the researcher chooses the output to be a table:
    1. The survival table or the suppressed survival table will be returned to the researcher. The table will be saved to the acro object.
  5. If the researcher chooses the output to be a plot:
    1. If suppress is false, the plot function will be used for the plot. The plot will be saved with a used specified name to a folder called (acro_artifacts). The plot will be added to the acro object with the output as the name of the saved file.
    2. If suppression is true, a new calculation should be made to calculate the rounded number of survivals, the rounded number of death and the survival function. Then the survival function can be plotted against the time (This is yet to be developed).

What should be added:

  1. The tests in test_initial.py

Questions:

  1. How should the outcome table look like? (Should it have the four columns of the survival table (Surv prob | Surv prob SE | num at risk | num events) with the threshold written to the whole row when it violates the rule or something else??)
  2. How should the suppressed table look like? (with suppressing the cells that violate the threshold or should it be the calculated table with the rounded number of survivals?)
  3. It seems that the prettify_table_string function doesn't handle column names with spaces.
  4. In the Word document provided by WP1 it is mentioned that "having no periods with the number of deaths below the threshold may be overkill for most practical graphs, so an alternative could be that the time unit is relative to the start time and starting times varied amongst data subjects". Does this mean we don't need to check the number of deaths when we are dealing with plots?
  5. When suppressing the plot, the rounded survival function is calculated and plotted against the time using the plot form matplotlib while in the actual plot the plot function from statsmodels.duration.survfunc.SurvfuncRight is used. Now, the suppressed plot doesn't show the censored data. Do we want to add this somehow to the plot?
codecov[bot] commented 1 year ago

Codecov Report

Merging #145 (82c6cdc) into main (40169b5) will decrease coverage by 0.47%. The diff coverage is 95.29%.

:exclamation: Current head 82c6cdc differs from pull request most recent head 3bab95a. Consider uploading reports for the commit 3bab95a to get more accurate results

@@             Coverage Diff             @@
##              main     #145      +/-   ##
===========================================
- Coverage   100.00%   99.53%   -0.47%     
===========================================
  Files            7        7              
  Lines          783      866      +83     
===========================================
+ Hits           783      862      +79     
- Misses           0        4       +4     
Files Changed Coverage Δ
acro/record.py 98.58% <78.57%> (-1.42%) :arrow_down:
acro/acro.py 99.64% <98.59%> (-0.36%) :arrow_down:
mahaalbashir commented 1 year ago

@rpreen do you think I should move some of the functions to for example utils.py to solve the (too-many-lines) error?

rpreen commented 1 year ago

@rpreen do you think I should move some of the functions to for example utils.py to solve the (too-many-lines) error?

The acro.py module needs to be broken up, but I think it's more than just moving things to utils.py. In fact, some of the functions in utils.py are really specific to the pandas crosstab and pivot_table functions and aren't really utilities at all. Really, we want one module with the pandas functions in, another for the statsmodels functions, and another for the survival analysis stuff. The thing is that we probably still want to maintain all the functions within one Acro class, so I'm not entirely sure the best way to do it; maybe using multiple inheritance or some other way... Maybe there is another architecture that still has a nice usability for the researcher; it needs some thought.

rpreen commented 1 year ago

I think multiple inheritance is probably the way to go. Are you comfortable trying to break it up that way?

mahaalbashir commented 1 year ago

That is a good idea. I can try doing that. Is it better to check this pull request for the survival analysis first and then do the reformatting separately?