GenericMappingTools / pygmt

A Python interface for the Generic Mapping Tools.
https://www.pygmt.org
BSD 3-Clause "New" or "Revised" License
747 stars 216 forks source link

pygmt.grdtrack: Wrong column names of output pandas DataFrame #2703

Open yvonnefroehlich opened 11 months ago

yvonnefroehlich commented 11 months ago

Description of the problem

This is related to the GMT forum post at https://forum.generic-mapping-tools.org/t/type-error-with-pygmt-grdtrack-possible-bug-or-intended/4286.

If the input pandas DataFrame contains an additional column with datetime objects, pygmt.grdtrack mixes up the column names of the output pandas Dataframe.

Maybe this is partly related to the issue #2463?

Minimal Complete Verifiable Example

import pygmt
import pandas as pd
import numpy as np
import datetime

np.random.seed(100)

# Make random test data
df = pd.DataFrame(columns=["lon", "lat", "trash", "trashdate"])
df["lon"] = 50 + 2 * np.random.randn(100)
df["lat"] = 55 + 4 * np.random.randn(100)
df["trash"] = 53 + 5 * np.random.randn(100)
# Correct column names after applying grdtrack
# df["trashdate"] = 51 + 7 * np.random.randn(100)
# Mixed up column names after applying grdtrack
df["trashdate"] = (51 + 7 * np.random.randn(100)) \
    * datetime.timedelta(seconds=3e7) + datetime.date(1950, 1, 1)

# Make mesh
spacing = 0.5
region = [min(df.lon), max(df.lon), min(df.lat), max(df.lat)]
n_lon = round((region[1] - region[0]) / spacing) + 1
n_lat = round((region[3] - region[2]) / spacing) + 1
lat, lon = np.mgrid[region[2]:region[3]:n_lat*1j, region[0]:region[1]:n_lon*1j]

# Make GMT ready grid with random z values
pygmt.xyz2grd(
    x=lon.flatten(),
    y=lat.flatten(),
    z=50 + 1 * np.random.randn(len(lon.flatten())),
    region=region,
    spacing=spacing,
    outgrid="test.grd",
)

# Extrakt the z values of the grid at the points stored in the pandas.DataFrame
# and write them to the new column "z_grid"
df_new = pygmt.grdtrack(
    grid="test.grd",
    points=df,
    newcolname="z_grid",
)

Compare the pandas DataFrames:

df
          lon        lat      trash   trashdate
0   46.500469  48.181395  56.022118  1994-11-01
1   50.685361  50.454956  48.464848  1998-02-18
2   52.306072  43.106738  55.960116  2019-08-08
3   49.495128  55.133269  50.814678  2020-09-02
4   51.962642  54.004445  53.508879  2009-02-25
..        ...        ...        ...         ...
95  50.006035  57.596692  48.621346  1999-09-30
96  49.847953  54.300874  53.588042  1984-11-06
97  50.007915  59.069057  48.209158  2003-09-13
98  49.629972  52.600068  43.594427  1996-12-07
99  45.025697  61.304667  53.911714  1990-08-08

[100 rows x 4 columns]

and

df_new
          lon        lat      trash  trashdate      z_grid
0   46.500469  48.181395  56.022118  50.241717  1994-11-01
1   50.685361  50.454956  48.464848  50.100797  1998-02-18
2   52.306072  43.106738  55.960116  50.701331  2019-08-08
3   49.495128  55.133269  50.814678  48.872544  2020-09-02
4   51.962642  54.004445  53.508879  50.619094  2009-02-25
..        ...        ...        ...        ...         ...
95  50.006035  57.596692  48.621346  50.866452  1999-09-30
96  49.847953  54.300874  53.588042  50.468180  1984-11-06
97  50.007915  59.069057  48.209158  49.848821  2003-09-13
98  49.629972  52.600068  43.594427  50.984231  1996-12-07
99  45.025697  61.304667  53.911714  50.887035  1990-08-08

[100 rows x 5 columns]

Full error message

No error message occurs, but the column names after applying grdtrack are mixed up in case the pandas DataFrame contains a column with datetime objects.

System information

PyGMT information:
  version: v0.10.1.dev25
System information:
  python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)]
  executable: C:\ProgramData\Anaconda3\envs\pygmt_env_dev\python.exe
  machine: Windows-10-10.0.19045-SP0
Dependency information:
  numpy: 1.24.3
  pandas: 2.0.2
  xarray: 2023.1.1.dev17
  netCDF4: 1.6.2
  packaging: 23.1
  contextily: 1.3.0
  geopandas: 0.13.2
  IPython: 8.14.0
  rioxarray: 0.14.1
  ghostscript: 9.54.0
GMT library information:
  binary version: 6.4.0
  cores: 4
  grid layout: rows
  image layout: 
  library path: C:/ProgramData/Anaconda3/envs/pygmt_env_dev/Library/bin/gmt.dll
  padding: 2
  plugin dir: C:/ProgramData/Anaconda3/envs/pygmt_env_dev/Library/bin/gmt_plugins
  share dir: C:/Program Files (x86)/gmt6/share
  version: 6.4.0
seisman commented 11 months ago

The following two GMT commands explain the behavior of the GMT's grdtrack module:

$ echo 46  48  56  1994-11-01 | gmt grdtrack -G@earth_relief_01d_g    
46  48  56  1994-11-01T00:00:00 -2

$ echo 46  48  56  TEXT | gmt grdtrack -G@earth_relief_01d_g
46  48  56  -2  TEXT

In the first case, the z value (-2) is appended as the last column, while in the 2nd case, the z value is inserted at the 3rd case (column number starting from 0) and the text column is appended as the last column. The philosophy of the GMT I/O mechanism is always putting text at the end of a row.

In your example, the df["trashdate"] has an object dtype, which is recognized as a string column, rather then a datetime column.

df["trashdate"] = (51 + 7 * np.random.randn(100)) \
    * datetime.timedelta(seconds=3e7) + datetime.date(1950, 1, 1)

To fix the issue, you need to explicitly convert the column to a datetime column instead. For example:

df["trashdate"] = (51 + 7 * np.random.randn(100)) \
    * datetime.timedelta(seconds=3e7) + datetime.date(1950, 1, 1)
df["trashdate"] = df["trashdate"].astype("datetime64[ms]")