geoschem / integrated_methane_inversion

Integrated Methane Inversion workflow repository.
https://imi.readthedocs.org
MIT License
26 stars 23 forks source link

Feature/updated-clustering approach #102

Closed laestrada closed 1 year ago

laestrada commented 1 year ago

This PR automatically generates the clustering pairs using the following algorithm:

  1. find the average sensitivity/ statevector element using the total estimated dofs/ desired number of elements. this is the threshold used for each element.
  2. sort the estimated sensitivities in descending order
  3. group the largest sensitivities until adding the next element would exceed the threshold. This is a cluster pair.
  4. remove the selected elements from the sensitivity list
  5. iterate through 3 and 4 until no elements are left This is the basic idea but it also covers some basic edge cases like:
    • If the sensitivity of a single element is greater than the threshold, then the element is added at native resolution.
    • If the algorithm produces too few elements it simply adds additional native resolution elements.
    • If the algorithm is about to exceed the number of elements, then all remaining elements are aggregated into a single element.

Still need to update the documentation. New clustering section for config.yml is:

# Clustering options
ReducedDimensionStateVector: true
NumberOfElements: 45
ForcedNativeResolutionElements: 
  - [31.5, -104]
laestrada commented 1 year ago

Under the context of this branch, updated the clustering algorithm to address issues with non-contiguous state vector elements.

New layered-kmeans algorithm: This new algorithm maintains the same spirit of Hannah's original version, but lends itself better to automation and minimizes the issues with contiguity. In essence, it successively layers different k-means generated state vector labels starting from high resolution and progressively reducing the resolution of the next layer, assigning the highest sensitivity labels from each layer to the final state vector.

In more detail, the algorithm basics:

  1. Initialize a state vector with all 0 labels in the region of interest. 0 represents a pixel that is yet to be assigned.

  2. Generate list of cluster pairs based on sensitivities of pixels and avg dofs per element. This attempts to get a reasonable guess at the optimal cluster pairings, but it does not take into account spatial proximity, relying on the subsequent steps to ensure contiguity between clusters.

  3. Use the highest resolution aggregation level in the list of cluster pairs to determine the number of clusters to use for kmeans

  4. Use k-means to cluster all 0 labeled pixels using 3 features for labeling designation (lat, lon, sensitivity) and the number of clusters specified in step 3. This results in a single layer with approximately evenly clustered elements.

  5. Using the state vector elements from step 4, calculate the average sensitivity per grid cell for each element.

  6. Assign labels to our actual statevector. We only assign the x highest sensitivity clusters based on the cluster pair selected in step 3.

  7. Repeat from step 3 using the next highest resolution aggregation level and only using unassigned grid cells (0 values)

Other Notes on this new algorithm: By default the lowest aggregation level is a 4x5 degree element and the algorithm "saves" the requisite number of elements needed to accomplish this for the final layer to be applied

Example generated SV using CONUS ROI: image

nicholasbalasus commented 1 year ago

Because we set the HEMCO output to only be END for the preview, will there always be one file here with one time step that corresponds to the average emissions over the entire inversion period?

nicholasbalasus commented 1 year ago

There is a warning for divide by zero here when m = 0, but I'm not sure how we would handle that. It's not causing any issues right now anyway.

nicholasbalasus commented 1 year ago

force_native_res_pixels throws an error if the lat,lon pair is outside the state vector (see here).

nicholasbalasus commented 1 year ago

Some quick notes:

Looks really good!

laestrada commented 1 year ago

Thanks for the comments @nicholasbalasus -- I believe I have addressed everything except the out of memory error which will be addressed by feature/memory-settings

djvaron commented 1 year ago

Besides some minor quibbles with the code comments/documentation, this looks good to me!

laestrada commented 1 year ago

All comments have been addressed -- merging!