Closed laestrada closed 1 year ago
Under the context of this branch, updated the clustering algorithm to address issues with non-contiguous state vector elements.
New layered-kmeans algorithm: This new algorithm maintains the same spirit of Hannah's original version, but lends itself better to automation and minimizes the issues with contiguity. In essence, it successively layers different k-means generated state vector labels starting from high resolution and progressively reducing the resolution of the next layer, assigning the highest sensitivity labels from each layer to the final state vector.
In more detail, the algorithm basics:
Initialize a state vector with all 0 labels in the region of interest. 0 represents a pixel that is yet to be assigned.
Generate list of cluster pairs based on sensitivities of pixels and avg dofs per element. This attempts to get a reasonable guess at the optimal cluster pairings, but it does not take into account spatial proximity, relying on the subsequent steps to ensure contiguity between clusters.
Use the highest resolution aggregation level in the list of cluster pairs to determine the number of clusters to use for kmeans
Use k-means to cluster all 0 labeled pixels using 3 features for labeling designation (lat, lon, sensitivity) and the number of clusters specified in step 3. This results in a single layer with approximately evenly clustered elements.
Using the state vector elements from step 4, calculate the average sensitivity per grid cell for each element.
Assign labels to our actual statevector. We only assign the x highest sensitivity clusters based on the cluster pair selected in step 3.
Repeat from step 3 using the next highest resolution aggregation level and only using unassigned grid cells (0 values)
Other Notes on this new algorithm: By default the lowest aggregation level is a 4x5 degree element and the algorithm "saves" the requisite number of elements needed to accomplish this for the final layer to be applied
Example generated SV using CONUS ROI:
Because we set the HEMCO output to only be END for the preview, will there always be one file here with one time step that corresponds to the average emissions over the entire inversion period?
There is a warning for divide by zero here when m = 0, but I'm not sure how we would handle that. It's not causing any issues right now anyway.
force_native_res_pixels
throws an error if the lat,lon pair is outside the state vector (see here).
Some quick notes:
aggregation.py
runs out of memory and time (and doesn’t print an error to imi_output.log). You might be addressing this in the feature/memory-settings
branch right now, but just something to keep in mind. It also took ~1 hour for the US domain using 8 cores on huce_cascade.find_cluster_pairs
is pretty cool and is not code structure I'm used to.Looks really good!
Thanks for the comments @nicholasbalasus -- I believe I have addressed everything except the out of memory error which will be addressed by feature/memory-settings
Besides some minor quibbles with the code comments/documentation, this looks good to me!
All comments have been addressed -- merging!
This PR automatically generates the clustering pairs using the following algorithm:
Still need to update the documentation. New clustering section for config.yml is: