btheodorou99 / HALO_Inpatient

21 stars 7 forks source link

How to determine the bucket ranges? #7

Closed BreezeHavana closed 11 months ago

BreezeHavana commented 11 months ago

Hi, I noticed that it is mentioned in paper the bucket ranges is determined by seeking advice from clinicians and so on, and I draw histograms and it looks like data in bucket is normally distributed. I tried Scott's and Freedman-Diaconis but neither seems good. Could you please provide any codes used in determining ranges? Thanks!

btheodorou99 commented 11 months ago

Our experimental ranges were provided by a clinician, but we also have written some code for standard bucket design that is even either in terms of values or percentiles. Both are below (they can be made more efficient by saving some variables, but we present them in one line each for simplicity):

NUM_BUCKETS = 10 EPSILON = 1e-10 lab_buckets = [(i, i + (max(lab_values) - min(lab_values))/NUM_BUCKETS + (0 if i < max(lab_values) else EPSILON)) for i in range(min(lab_values), max(lab_values)+EPSILON, (max(lab_values) - min(lab_values))/NUM_BUCKETS)] lab_buckets = [(np.percentile(lab_values, i), np.percentile(lab_values, i+100/NUM_BUCKETS) + (0 if i < 100 - 100/NUM_BUCKETS else EPSILON)) for i in range(0, 100, 100/NUM_BUCKETS)]

Also, note the epsilon is because the buckets are [inclusive, exclusive) so we add a tiny bit to include the max value.

BreezeHavana commented 11 months ago

thanks