Functionality for getting duplicated coordinates

OpenSenseAction / poligrain

Simplify common tasks for working with point, line and gridded sensor data, focusing on rainfall observations.

https://poligrain.readthedocs.io

BSD 3-Clause "New" or "Revised" License

2 stars 10 forks source link

Functionality for getting duplicated coordinates #52

Closed JochenSeidel closed 3 months ago

JochenSeidel commented 4 months ago

I'd like to follow up on this discussion from pypwsqc as this is also relevant for CMLs.

We should implement a function to identify duplicated coordinates. This is an issue with Netatmo PWS when users don't allocate the correct location of their PWS in the web interface. This is quite frequent in PWS data and indicates that all PWS with duplicate coordinates are not placed correctly. In this case these PWS need to be removed. For CMLs, it might be of interest to find multiple CMLs on one tower for example. In this case, the IDs of these CML should be returned. So far, the easiest option to identify duplicated coordinates is using pandas.DataFrame.duplicated but apparently there's nothing similar available in xarray... Therefore I suggest to write a function that identifies duplicated coordinates and either returns indices or discards these entries in a data set.

cchwala commented 4 months ago

I suggest something like this

def get_duplicate_coordinates(ds, keep='first'):
    df = ds.to_dataframe() # maybe needs some additional arg, I do not know...

    # maybe check if we use a CML (line) or PWS dataset, to be discussed

    # return a boolean series which can easily be used for indexing in the initial `ds`
    return df.duplicated(subset=['lon', 'lat'], keep=keep)
    # maybe we could transform the output into a xr.DataArray with the correct dimension
    # to make it 100% clear what is what

This is just written from the top of my head, withouth testing any of it, but I hope it gets the idea across.

@JochenSeidel Would that fit?

JochenSeidel commented 4 months ago

Yes thanks, this sounds good! Two aspects for further discussion:

1) Should we transform/backtransform from xarrayto pandas (which is tempting but might also bear some risks..) or try to figure out something that works in xarraydirectly? 2) I think it's safe to set the keep=False option as default for our purposes because then all duplicates coordinates are returned (otherwise the first or last duplicate is kept).

JochenSeidel commented 4 months ago

Here's somethings that seems to work. I added an option to explicitly select the coordinates and id dimension by names if a data set is not according to our OS convention. In the end this is a 2-liner, does this justify a function?

ddef get_duplicate_coordinates(ds, id=id, keep=False, lat='lat', lon='lon'):
    # lat and lon can be set if a dataset does not follow our OS cenvention
    df = ds.id.to_dataframe() # assumes that there is an 'id' dimension

    # maybe check if we use a CML (line) or PWS dataset, to be discussed

    # return a boolean series which can easily be used for indexing in the initial `ds`
    return df.duplicated(subset=[lon, lat], keep=keep)
    # maybe we could transform the output into a xr.DataArray with the correct dimension
    # to make it 100% clear what is what

cchwala commented 4 months ago

In the end this is a 2-liner, does this justify a function?

Yes, that is a good question. It could even fit on one line... Hence, I am not sure how to proceed. Maybe adding this to notebook with "xarray and pandas recipes" is the best option. Maybe just discuss that during the next meetings while I am away.

I added an option to explicitly select the coordinates and id dimension

Please note, that you are not using the variable id from your argument inside the function. You would have to do something like

def foo(ds, id_var_name='id'):
    df = ds[id_var_name].to_dataframe()
    # blabla...

JochenSeidel commented 4 months ago

Thanks for pointing this out. I'm also away next week, let's see what happens in the next meeting...

cchwala commented 3 months ago

@JochenSeidel I think we can close this because you show the code to do it in #55. Is so, please close.