Feature/updated-clustering approach

laestrada commented 1 year ago

This PR automatically generates the clustering pairs using the following algorithm:

find the average sensitivity/ statevector element using the total estimated dofs/ desired number of elements. this is the threshold used for each element.
sort the estimated sensitivities in descending order
group the largest sensitivities until adding the next element would exceed the threshold. This is a cluster pair.
remove the selected elements from the sensitivity list
iterate through 3 and 4 until no elements are left This is the basic idea but it also covers some basic edge cases like:
- If the sensitivity of a single element is greater than the threshold, then the element is added at native resolution.
- If the algorithm produces too few elements it simply adds additional native resolution elements.
- If the algorithm is about to exceed the number of elements, then all remaining elements are aggregated into a single element.

Still need to update the documentation. New clustering section for config.yml is:

# Clustering options
ReducedDimensionStateVector: true
NumberOfElements: 45
ForcedNativeResolutionElements: 
  - [31.5, -104]

laestrada commented 1 year ago

Under the context of this branch, updated the clustering algorithm to address issues with non-contiguous state vector elements.

New layered-kmeans algorithm: This new algorithm maintains the same spirit of Hannah's original version, but lends itself better to automation and minimizes the issues with contiguity. In essence, it successively layers different k-means generated state vector labels starting from high resolution and progressively reducing the resolution of the next layer, assigning the highest sensitivity labels from each layer to the final state vector.

In more detail, the algorithm basics:

Initialize a state vector with all 0 labels in the region of interest. 0 represents a pixel that is yet to be assigned.
Generate list of cluster pairs based on sensitivities of pixels and avg dofs per element. This attempts to get a reasonable guess at the optimal cluster pairings, but it does not take into account spatial proximity, relying on the subsequent steps to ensure contiguity between clusters.
Use the highest resolution aggregation level in the list of cluster pairs to determine the number of clusters to use for kmeans
Use k-means to cluster all 0 labeled pixels using 3 features for labeling designation (lat, lon, sensitivity) and the number of clusters specified in step 3. This results in a single layer with approximately evenly clustered elements.
Using the state vector elements from step 4, calculate the average sensitivity per grid cell for each element.
Assign labels to our actual statevector. We only assign the x highest sensitivity clusters based on the cluster pair selected in step 3.
Repeat from step 3 using the next highest resolution aggregation level and only using unassigned grid cells (0 values)

Other Notes on this new algorithm: By default the lowest aggregation level is a 4x5 degree element and the algorithm "saves" the requisite number of elements needed to accomplish this for the final layer to be applied

It is still possible to get some non-contiguous elements, but only for the lower-resolution elements. In practice, I have not seen this happen in my testing.
It is more computationally expensive than the original simplified version because of the (potentially many) iterations of k-means. It takes ~15 minutes on my local computer to cluster CONUS. To address this I have also included an option to use mini-batch-kmeans, which is very similar to kmeans, but much faster for large datasets with a price of slightly less accuracy. This reduces the time to ~3 minutes for generating the CONUS state vector.
Estimated sensitivities are generated in the same manner as is used in the IMI preview, which takes into account both prior emissions and observation density.

Example generated SV using CONUS ROI:

nicholasbalasus commented 1 year ago

Because we set the HEMCO output to only be END for the preview, will there always be one file here with one time step that corresponds to the average emissions over the entire inversion period?

nicholasbalasus commented 1 year ago

There is a warning for divide by zero here when m = 0, but I'm not sure how we would handle that. It's not causing any issues right now anyway.

nicholasbalasus commented 1 year ago

force_native_res_pixels throws an error if the lat,lon pair is outside the state vector (see here).

nicholasbalasus commented 1 year ago

Some quick notes:

On Cannon, aggregation.py runs out of memory and time (and doesn’t print an error to imi_output.log). You might be addressing this in the feature/memory-settings branch right now, but just something to keep in mind. It also took ~1 hour for the US domain using 8 cores on huce_cascade.
Good catch on forcing it to run the preview if it hasn’t been run yet! Also the way in which you coded find_cluster_pairs is pretty cool and is not code structure I'm used to.
I didn't test it outside the US domain but don't see anything that will go wrong except for if there is a force native resolution outside the user's domain. I'll be trying this for Africa at some point.

Looks really good!

laestrada commented 1 year ago

Thanks for the comments @nicholasbalasus -- I believe I have addressed everything except the out of memory error which will be addressed by feature/memory-settings

djvaron commented 1 year ago

Besides some minor quibbles with the code comments/documentation, this looks good to me!

laestrada commented 1 year ago

All comments have been addressed -- merging!

geoschem / integrated_methane_inversion

Feature/updated-clustering approach #102