STATS: Re-running program efficiently

sssangha commented 4 years ago

This ticket outlines the motivation and proposed strategy behind adding in the ability to re-run the codes efficiently without restarting from scratch everytime. E.g. if a user specifies all of the same input arguments the second time, but the only difference is that the input csv has been expanded to include 2 more months of data, just append to existing output csv (outlined in more detail below) without re-running the entire analysis. In the end, we should only start regenerating outputs from scratch when the user specifies a new spatiotemporal argument (i.e. different bounding box, different grid spacing, different time-span, etc.).

For each new statsclass call, I propose writing the following files to an output subdirectory: (1) the modified pandas dataframe i.e. the primary output CSV (which is currently passed around in memory anyways), (2) the dictionary of user input arguments (omitting the reference to the number of threads and verbose mode, which aren't relevant), (3) a duplicate input CSV, and (4) a CSV record of variogram fits for each given given timeslice and gridcell.

Once a user relaunches the statsclass, the logic follows as so:

Compare new user inputs with the record of user input arguments from the previous run. If they are the same then continue, if different then generate all outputs from scratch.
Then examine the difference between new input CSV and duplicate input CSV from the previous run. If they are the same then continue, if spatiotemporal sampling differs as described above, then re-grid and regenerate the primary output CSV and duplicate input CSV, and restart the variogram analysis. But if the record has just been expanded in time, then just append to the primary output CSV and start off the variogram analysis from where the user had left off.
If no differences are flagged, then just print a message stating input CSV/arguments haven't changed, thus outputs have not been modified.

This is the cleanest logic I could formulate at the moment, and indeed perhaps it would make more sense once I have a working version implementing such proposed changes. But if any of you have specific comments and requested features/adjustments of relevance to these efforts, please discuss them here.

Furthermore, this conceptualization should no longer necessitate the need to separate the stats class from plotting, as if this is executed properly, then the logic of the code should handle such checks and plotting seamlessly under one primary code

sssangha commented 4 years ago

Storing a physical dictionary of inputs arguments would not require much overhead. However a clean, yet less transparent alternative would be to generate a unique hash and store it to a hash file that may be referenced vis-a-vis successive user inputs.

Just as a note, from what I've seen it is recommended that such a dictionary of input arguments is to be serialized as JSON before hashing. E.g.:

import hashlib
import json

a={'intputfile':'/path/to/file', 'bbox':'-9 9 10 12'}
print(hashlib.sha1(json.dumps(a, sort_keys=True).encode()).hexdigest())

Source: https://stackoverflow.com/questions/16092594/how-to-create-a-unique-key-for-a-dictionary-in-python

Any strong feelings one way or the other? I.e. for storing hash files as opposed to a full dictionary of input arguments (which should be very small anyways).

jlmaurer commented 4 years ago

I think an HDF5 file with a specified set of attributes would work well. This is similar to what Mintpy does, where it queries several HDF5 file at each step to decide what to run. It could store the 'intermediate' data such as grid bins, analysis attributes, etc. then the plotting routine could simply query the file.

dbekaert / RAiDER

STATS: Re-running program efficiently #94