InstituteforDiseaseModeling / malaria-model_validation

1 stars 2 forks source link

speed up plotting script for site navrongo_2000 #52

Closed YeChen-IDM closed 1 year ago

YeChen-IDM commented 1 year ago

at some point, we have a sim_duration_survey_sampling.csv file with size of ~4MB and the performance is acceptable in Python. Currently this site will only generate a patient_reports.csv file with size of 1.2GB and it's too slow to perform dataframe operation with this size of data.

We should look for a way to improve the performance of plotting result for this site.

YeChen-IDM commented 1 year ago

some ideas:

  1. can we reduce seed from 20 to a lower number?
  2. can we speed up the pandas operation. (pd.apply() vs itertuples)?
  3. can we change the 2 for loops?
YeChen-IDM commented 1 year ago
            import time
            start_t = time.time()
            sim['date'] = sim.apply(lambda row: first_ref_date + datetime.timedelta(days=int(row.simday)), axis=1)
            end_t = time.time()
            print(end_t-start_t)

            start_t = time.time()
            sim['date_2'] = [first_ref_date + datetime.timedelta(days=int(simday)) for simday in sim['simday']]
            end_t = time.time()
            print(end_t - start_t)

result:

26.090492486953735
0.9626128673553467
YeChen-IDM commented 1 year ago

Another change I made is filtering out the main data frame by seed first then perform other operations which it's faster than doing the operations on the main data frame.

YeChen-IDM commented 1 year ago

this is done with PR #53