Closed JochenSeidel closed 3 months ago
I suggest something like this
def get_duplicate_coordinates(ds, keep='first'):
df = ds.to_dataframe() # maybe needs some additional arg, I do not know...
# maybe check if we use a CML (line) or PWS dataset, to be discussed
# return a boolean series which can easily be used for indexing in the initial `ds`
return df.duplicated(subset=['lon', 'lat'], keep=keep)
# maybe we could transform the output into a xr.DataArray with the correct dimension
# to make it 100% clear what is what
This is just written from the top of my head, withouth testing any of it, but I hope it gets the idea across.
@JochenSeidel Would that fit?
Yes thanks, this sounds good! Two aspects for further discussion:
1) Should we transform/backtransform from xarray
to pandas
(which is tempting but might also bear some risks..) or try to figure out something that works in xarray
directly?
2) I think it's safe to set the keep=False
option as default for our purposes because then all duplicates coordinates are returned (otherwise the first or last duplicate is kept).
Here's somethings that seems to work. I added an option to explicitly select the coordinates and id dimension by names if a data set is not according to our OS convention. In the end this is a 2-liner, does this justify a function?
ddef get_duplicate_coordinates(ds, id=id, keep=False, lat='lat', lon='lon'):
# lat and lon can be set if a dataset does not follow our OS cenvention
df = ds.id.to_dataframe() # assumes that there is an 'id' dimension
# maybe check if we use a CML (line) or PWS dataset, to be discussed
# return a boolean series which can easily be used for indexing in the initial `ds`
return df.duplicated(subset=[lon, lat], keep=keep)
# maybe we could transform the output into a xr.DataArray with the correct dimension
# to make it 100% clear what is what
In the end this is a 2-liner, does this justify a function?
Yes, that is a good question. It could even fit on one line... Hence, I am not sure how to proceed. Maybe adding this to notebook with "xarray and pandas recipes" is the best option. Maybe just discuss that during the next meetings while I am away.
I added an option to explicitly select the coordinates and id dimension
Please note, that you are not using the variable id
from your argument inside the function. You would have to do something like
def foo(ds, id_var_name='id'):
df = ds[id_var_name].to_dataframe()
# blabla...
Thanks for pointing this out. I'm also away next week, let's see what happens in the next meeting...
@JochenSeidel I think we can close this because you show the code to do it in #55. Is so, please close.
I'd like to follow up on this discussion from pypwsqc as this is also relevant for CMLs.
We should implement a function to identify duplicated coordinates. This is an issue with Netatmo PWS when users don't allocate the correct location of their PWS in the web interface. This is quite frequent in PWS data and indicates that all PWS with duplicate coordinates are not placed correctly. In this case these PWS need to be removed. For CMLs, it might be of interest to find multiple CMLs on one tower for example. In this case, the IDs of these CML should be returned. So far, the easiest option to identify duplicated coordinates is using
pandas.DataFrame.duplicated
but apparently there's nothing similar available inxarray
... Therefore I suggest to write a function that identifies duplicated coordinates and either returns indices or discards these entries in a data set.