find and remove proxy duplicates

CommonClimate commented 2 years ago

the graphem bug with singular matrix illustrated the perils of having duplicates in the proxy matrix. I was originally thinking that it would only be an issue for graphem, and therefore should be dealt with within prep_graphem(), but I now believe it needs to be done earlier in the workflow.

Here is my proposal:

create a new method in the ProxyDatabase class called find_duplicates(), governed by a parameter called r_thresh (default = 0.9). Within that function, compute R = np.triu(np.corrcoef(proxy.T),k=1) (where proxy is the proxy matrix) and find the indices/labels of the records for which R > r_thresh.
offer the option to visualize those potential duplicates (by repurposing ProxyRecord.plot() and plotting the two close series in the same Axes object (different colors and/or line styles, whatever works best to tell them apart).
ask the user which ones they want to remove and add those indices/labels to a list
at the end, bundle those records into a "db_to_remove" instance of ProxyDatabase , so the user can subtract those proxies from the original database using the "-" syntax. (don't do it for them, though ; this must be an explicit part of the workflow so they can remember that they did it).

I believe this will be cleanest and most transparent, as the users will have to make careful, explicit decisions. Now that I think of it, we had to do this a lot as part of PAGES 2k ca 2015-2016, because several groups had included the same proxy series, or several slightly different versions of the same proxy. I bet this will be helpful for CoralHydro2k as well. And it will come in handy when merging two databases that have potential duplicates. So overall a very useful feature that will serve for both pseudo- and real proxy recons.

fzhu2e commented 2 years ago

The proposal is implemented in 9d57034.

This notebook shows the new methods: https://github.com/fzhu2e/cfr/blob/graphem_redesign/docsrc/notebooks/proxy-dups.ipynb

The ProxyDatabase.find_duplicates() is implemented exactly as proposed. It will list the groups of duplicates.
The visualization method ProxyRecord.plot_dups() is added to plot duplicates.
The printed message from pdb_dups = ProxyDatabase.find_duplicates() reminds the users to use the method pdb_to_keep = pdb_dups.sequeeze_dups(pids_to_keep=pid_list) to keep only one record from each group of duplicates.
Then with job.proxydb = job.proxydb - pdb_dups + pdb_to_keep, we get a cleaned database without any duplicates.
With the cleaned database, the GraphEM test is passed.

I also renamed to original notebook that contains the debug cells to graphem-ppe-pages2k-debug.ipynb: https://github.com/fzhu2e/cfr/blob/graphem_redesign/docsrc/notebooks/graphem-ppe-pages2k-debug.ipynb

A notebook with the original name is updated with the correct pseudoproxy dataset: https://github.com/fzhu2e/cfr/blob/graphem_redesign/docsrc/notebooks/graphem-ppe-pages2k.ipynb

CommonClimate commented 2 years ago

Excellent work! I made a few edits to proxy-dups.ipynb in dd355d3824c597ca934fc093f88058b904f12264, and it led to this one small suggestion: I like for the default behavior of pdb_dups.plot() to plot the map and the list of duplicates, but it would be good to allow users to loop through the duplicate cases using a simple index instead of putting in the record label. For instance, proxy-dups.ipynb identifies 20 cases of duplicates, and it would be great to be able to loop through the indices, so pdb_dups[3].plot_dups() produced the same result as pdb_dups.records['Ocn_093'].plot_dups() . Does that make sense? iterating over dictionary keys, or labels, is not everyone’s favorite method, and i feel like it would also be valuable to let people iterate over indices. In this case, I can see there are 20 cases to deal with, so I can simply iterate over those 20 indices and resolve as I go. Note that for real-world cases, the proxy plots might need an option to incorporate more metadata than just the label (e.g. publication), but we can deal with that later.

fzhu2e commented 2 years ago

Thanks for the insightful suggestion!

I made the updates and the graphem_redesign brach is now merged into main. See the section Slice a ProxyDatabase by index in this notebook: https://github.com/fzhu2e/cfr/blob/main/docsrc/notebooks/proxy-ops.ipynb Now a ProxyDatabase supports three ways of subscripting:

by a int, e.g., pdb[1], returns a ProxyRecord
by a str, e.g., pdb['Ocn_065'], returns a ProxyRecord
by a slice, e.g., pdb[2:5], returns a ProxyDatabase

Will need to refresh the notebooks for the paper later. Closing this issue now. Please reopen if needed.

fzhu2e / cfr

find and remove proxy duplicates #3