GenericMappingTools / pygmt

A Python interface for the Generic Mapping Tools.
https://www.pygmt.org
BSD 3-Clause "New" or "Revised" License
747 stars 216 forks source link

pygmt.grdtrack: Bad behaviour when dataframe index isn't default #2763

Open fedesuad opened 11 months ago

fedesuad commented 11 months ago

Description of the problem

When I try to use pygmt.grdtrack on a dataframe that doesn't have default indexing (0 through len(df)), like a slice from an original dataframe, it doesn't work properly. For what I can gather, it works up until the index that coincides with the length of the dataframe, so if the index goes higher than that it starts giving out NaN values.

Minimal Complete Verifiable Example

import pandas as pd
import numpy as np
import pygmt

df = pd.DataFrame(columns=['lon','lat','height'])
df['lon']=np.random.default_rng().uniform(low=-74,high=-71,size=[5000])
df['lat']=np.random.default_rng().uniform(low=-51,high=-46,size=[5000])
df['height']=np.random.default_rng().uniform(low=0,high=3000,size=[5000])

spacing=0.1 #degrees
region=[-74.0, -71.0, -51.0, -46.0]

grd=pygmt.blockmean(data=df, region=region, spacing=spacing)
grd=pygmt.xyz2grd(data=grd, region=region, spacing=spacing)

over_1500=df[df['height']>1500].copy()

over_1500[['lon','lat','meanheight']]=pygmt.grdtrack(
    grid=grd,points=over_1500[['lon','lat']],
    newcolname="meanheight")

print(over_1500)

over_1500=df[df['height']>1500].copy()
over_1500.reset_index(drop=True,inplace=True)
over_1500[['lon','lat','meanheight']]=pygmt.grdtrack(
    grid=grd,points=over_1500[['lon','lat']],
    newcolname="meanheight")
print(over_1500)

Full error message

No error message but a DataFrame with NaN values

            lon        lat       height   meanheight
0    -71.931949 -47.668331  2589.084799  1298.987358
3    -73.828530 -49.563440  1939.247423  1407.897766
4    -72.569494 -49.096500  1676.786269  1716.785175
5    -72.487901 -50.849346  2638.614151  1758.230110
6    -72.927679 -48.920109  1920.867327  1772.126105
        ...        ...          ...          ...
4988        NaN        NaN  2096.621535          NaN
4991        NaN        NaN  1779.764729          NaN
4992        NaN        NaN  2378.422848          NaN
4994        NaN        NaN  1908.190985          NaN
4995        NaN        NaN  2744.655240          NaN

System information

PyGMT information:
  version: v0.10.0
System information:
  python: 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:39:03)  [GCC 11.3.0]
  executable: /home/fedesuad/mambaforge/envs/cin2020/bin/python
  machine: Linux-5.4.0-73-generic-x86_64-with-glibc2.31
Dependency information:
  numpy: 1.23.5
  pandas: 1.3.4
  xarray: 0.19.0
  netCDF4: 1.6.4
  packaging: 23.2
  contextily: None
  geopandas: None
  IPython: 8.14.0
  rioxarray: None
  ghostscript: 9.54.0
GMT library information:
  binary version: 6.4.0
  cores: 8
  grid layout: rows
  image layout: 
  library path: /home/fedesuad/mambaforge/envs/cin2020/lib/libgmt.so
  padding: 2
  plugin dir: /home/fedesuad/mambaforge/envs/cin2020/lib/gmt/plugins
  share dir: /home/fedesuad/mambaforge/envs/cin2020/share/gmt
  version: 6.4.0
welcome[bot] commented 11 months ago

👋 Thanks for opening your first issue here! Please make sure you filled out the template with as much detail as possible. You might also want to take a look at our contributing guidelines and code of conduct.

seisman commented 11 months ago

@fedesuad Thanks for your detailed report. I can reproduce the issue with your example script.

The issue is because, currently PyGMT writes the grdtrack output into a CSV file, and then calls pd.read_csv to read the output into a pd.DataFrame object.

https://github.com/GenericMappingTools/pygmt/blob/f828bc530a49c363ecc088101791b61583115e1e/pygmt/src/grdtrack.py#L314

For the returned pd.DataFrame object, the index defaults to 0-len(df), but you're expecting it to have the same index as the input pd.DataFame object (the points parameter).

We have many options for the index of the returned pd.DataFrame object

  1. Always use 0-based index
  2. Use the index of the input points parameter if points is pd.DataFrame, but use 0-based index for other input types (e.g., a file or a 2D array)
  3. Add a new option to allow users pass the index (similar to the newcolname parameter)

@fedesuad @GenericMappingTools/pygmt-maintainers What do you think?

fedesuad commented 11 months ago

Thanks for the response @seisman ! Option 2 seems the most intuitive to me, at least for my applications.

yvonnefroehlich commented 10 months ago

I am sorry for being late here!

For the points parameter, I feel it's fair to request that it is possible to account for the index of the single records (rows) in case of a pd.DataFrame: For a pd.DataFrame, the index is not only for counting, but it represents a record uniquely and remains unchanged when building a subset. This is different from an array, e.g., row 4 will become row 3 when excluding or removing row 3. Thus, options 2 and 3 appear good to me.

Hm. Currently, I have some preference for option 3. The new parameter would probably be optional, or? So, by default, we could keep the zero-based indexing. This would not affect existing code. And maybe it would be easier for users with less experience with pandas and pd.DataFrames? If users want to use a specific index, they can pass it to the new parameter. If I understood it correctly, for option 3, the code would change to something like this (I just used new_index as name for the new parameter):

over_1500=df[df['height']>1500].copy()

over_1500[['lon', 'lat', 'meanheight']] = pygmt.grdtrack(
    grid=grd,
    points=over_1500[['lon', 'lat']],
    newcolname="meanheight",
    # New parameter to pass index which should be used for the output dataframe,
    # instead of the default zero-based indexing
    new_index=over_1500.index,
)
seisman commented 10 months ago

Hm. Currently, I have some preference for option 3. The new parameter would probably be optional, or? So, by default, we could keep the zero-based indexing. This would not affect existing code. And maybe it would be easier for users with less experience with pandas and pd.DataFrames? If users want to use a specific index, they can pass it to the new parameter.

I also prefer to option 3. Option 2 is more intuitive for grdtrack but may make no sense for other modules. I think we should keep the behavior consistent in all functions that output a table.

Better to address this feature request and other grdtrack issues after PR https://github.com/GenericMappingTools/pygmt/pull/2729.