fzhu2e / cfr

A Python package for Climate Field Reconstruction
https://fzhu2e.github.io/cfr
BSD 3-Clause "New" or "Revised" License
13 stars 6 forks source link

find and remove proxy duplicates #3

Closed CommonClimate closed 2 years ago

CommonClimate commented 2 years ago

the graphem bug with singular matrix illustrated the perils of having duplicates in the proxy matrix. I was originally thinking that it would only be an issue for graphem, and therefore should be dealt with within prep_graphem(), but I now believe it needs to be done earlier in the workflow.

Here is my proposal:

I believe this will be cleanest and most transparent, as the users will have to make careful, explicit decisions. Now that I think of it, we had to do this a lot as part of PAGES 2k ca 2015-2016, because several groups had included the same proxy series, or several slightly different versions of the same proxy. I bet this will be helpful for CoralHydro2k as well. And it will come in handy when merging two databases that have potential duplicates. So overall a very useful feature that will serve for both pseudo- and real proxy recons.

fzhu2e commented 2 years ago

The proposal is implemented in 9d57034.

This notebook shows the new methods: https://github.com/fzhu2e/cfr/blob/graphem_redesign/docsrc/notebooks/proxy-dups.ipynb

I also renamed to original notebook that contains the debug cells to graphem-ppe-pages2k-debug.ipynb: https://github.com/fzhu2e/cfr/blob/graphem_redesign/docsrc/notebooks/graphem-ppe-pages2k-debug.ipynb

A notebook with the original name is updated with the correct pseudoproxy dataset: https://github.com/fzhu2e/cfr/blob/graphem_redesign/docsrc/notebooks/graphem-ppe-pages2k.ipynb

CommonClimate commented 2 years ago

Excellent work! I made a few edits to proxy-dups.ipynb in dd355d3824c597ca934fc093f88058b904f12264, and it led to this one small suggestion: I like for the default behavior of pdb_dups.plot() to plot the map and the list of duplicates, but it would be good to allow users to loop through the duplicate cases using a simple index instead of putting in the record label. For instance, proxy-dups.ipynb identifies 20 cases of duplicates, and it would be great to be able to loop through the indices, so pdb_dups[3].plot_dups() produced the same result as pdb_dups.records['Ocn_093'].plot_dups() . Does that make sense? iterating over dictionary keys, or labels, is not everyone’s favorite method, and i feel like it would also be valuable to let people iterate over indices. In this case, I can see there are 20 cases to deal with, so I can simply iterate over those 20 indices and resolve as I go. Note that for real-world cases, the proxy plots might need an option to incorporate more metadata than just the label (e.g. publication), but we can deal with that later.

fzhu2e commented 2 years ago

Thanks for the insightful suggestion!

I made the updates and the graphem_redesign brach is now merged into main. See the section Slice a ProxyDatabase by index in this notebook: https://github.com/fzhu2e/cfr/blob/main/docsrc/notebooks/proxy-ops.ipynb Now a ProxyDatabase supports three ways of subscripting:

  1. by a int, e.g., pdb[1], returns a ProxyRecord
  2. by a str, e.g., pdb['Ocn_065'], returns a ProxyRecord
  3. by a slice, e.g., pdb[2:5], returns a ProxyDatabase

Will need to refresh the notebooks for the paper later. Closing this issue now. Please reopen if needed.