GenericMappingTools / pygmt

A Python interface for the Generic Mapping Tools.
https://www.pygmt.org
BSD 3-Clause "New" or "Revised" License
747 stars 216 forks source link

Figure.plot: Crash for pd.DataFrame input containing floats when using "data" and "incols" #2637

Open yvonnefroehlich opened 1 year ago

yvonnefroehlich commented 1 year ago

Description of the problem

Under specific circumstances, Figure.plot does not work if a pandas.DataFrame is passed to the data parameter and a column order is selected via incols. The issue does not occur in case the pd.DataFrame contains only integers. If the desired columns are passed directly to the x and y parameters, the code works well. For me, this issue occurs under Windows but not under Linux.

For context, see PR #2515 up on comment https://github.com/GenericMappingTools/pygmt/pull/2515#issuecomment-1685121627

Maybe related to the issues in

Minimal Complete Verifiable Example

import pandas as pd
import pygmt 

size = 5

# Set up random test data
test_dict_int = {
    'a': [ 2,  2, 2, 2],
    'z': [ 8,  6, 7, 3],
    'x': [-3, -1, 1, 3],
    'y': [ 2,  2, 2, 2],
}
test_df_int = pd.DataFrame(data=test_dict_int)

fig = pygmt.Figure()

fig.basemap(
    region=[-size, size, -size, size],
    projection="X" + str(size*2),
    frame=True,
)

fig.plot(
    # data=test_df_int,  # integers -> WORKs
    data=test_df_int.astype(float),  # floats -> FAILs
    incols=[2, 3],
    # verbose="d",
)

# fig.show()
# fig.savefig(fname="bug_MWE.png")

Output of verbose="d"

plot [DEBUG]: Look for file -5/5/-5/5 in C:/Users/Admin/.gmt
plot [DEBUG]: Look for file -5/5/-5/5 in C:/Users/Admin/.gmt/cache
plot [DEBUG]: Look for file -5/5/-5/5 in C:/Users/Admin/.gmt/server
plot [DEBUG]: Got regular w/e/s/n for region (-5/5/-5/5)
plot [INFORMATION]: Processing input table data
plot [DEBUG]: Operation will require 2 input columns [n_cols_start = 2]
plot [DEBUG]: Reset MAP_ANNOT_OBLIQUE to anywhere
plot [DEBUG]: Projected values in meters: -5 5 -5 5
plot [DEBUG]: Computed automatic parameters using dimension scaling: 0.9
plot [INFORMATION]: Map scale is 0.001 km per cm or 1:100.
plot [DEBUG]: Running in PS mode modern
plot [DEBUG]: Use PS filename C:/Users/Admin/.gmt/sessions/gmt_session.20196/gmt_1.ps-
plot [DEBUG]: Append to hidden PS file C:/Users/Admin/.gmt/sessions/gmt_session.20196/gmt_1.ps-
plot [DEBUG]: Got session name as pygmt-session and default graphics formats as pdf
plot [DEBUG]: Basemap order: Frame = above  Grid = below  Tick/Annot = below
plot [DEBUG]: gmtapi_init_import: Passed family = Data Table and geometry = Line
plot [DEBUG]: gmtapi_init_import: Added 1 new sources
plot [DEBUG]: GMT_Init_IO: Returned first Input object ID = 0
plot [DEBUG]: gmtapi_begin_io: Input resource access is now enabled [container]
plot [DEBUG]: gmtapi_import_dataset: Passed ID = -1 and mode = 0
plot [INFORMATION]: Referencing data table from user 4 column arrays of length 4
plot [DEBUG]: Object ID 1 : Registered Data Table Memory Reference 1e0f157bfc0 as an Input resource with geometry Point [n_objects = 2]
plot [DEBUG]: gmtapi_import_dataset processed 1 resources
plot [DEBUG]: GMT_End_IO: Input resource access is now disabled
plot [INFORMATION]: Plotting segment 0
plot [DEBUG]: GMT memory: Initialize 2 temporary column double arrays, each of length : 0

Full error message

Windows fatal exception: code 0xc0000374

Main thread:
Current thread 0x00003654 (most recent call first):
  File "C:\ProgramData\Anaconda3\envs\pygmt_env_dev\Lib\site-packages\pygmt\clib\session.py", line 624 in call_module
  File "C:\ProgramData\Anaconda3\envs\pygmt_env_dev\Lib\site-packages\pygmt\src\plot.py", line 267 in plot
  File "C:\ProgramData\Anaconda3\envs\pygmt_env_dev\Lib\site-packages\pygmt\helpers\decorators.py", line 738 in new_module
  File "C:\ProgramData\Anaconda3\envs\pygmt_env_dev\Lib\site-packages\pygmt\helpers\decorators.py", line 598 in new_module
  File "C:\ProgramData\Anaconda3\envs\pygmt_env_dev\Lib\site-packages\pygmt\helpers\decorators.py", line 818 in new_module
  File "c:\users\admin\c2\eigenedokumente\studium\promotion\e_gmt\00_testing\001_gmt_pygmt\pr_tracksampling\bug_mwe_red.py", line 35 in <module>
  File "C:\ProgramData\Anaconda3\envs\pygmt_env_dev\Lib\site-packages\spyder_kernels\py3compat.py", line 356 in compat_exec
  File "C:\ProgramData\Anaconda3\envs\pygmt_env_dev\Lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 473 in exec_code
  File "C:\ProgramData\Anaconda3\envs\pygmt_env_dev\Lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 615 in _exec_file
  File "C:\ProgramData\Anaconda3\envs\pygmt_env_dev\Lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 528 in runfile
  File "C:\Users\Admin\AppData\Local\Temp\ipykernel_1076\1879108342.py", line 1 in <module>

Restarting kernel...

System information

PyGMT information:
  version: v0.9.1.dev125
System information:
  python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)]
  executable: C:\ProgramData\Anaconda3\envs\pygmt_env_dev\python.exe
  machine: Windows-10-10.0.19045-SP0
Dependency information:
  numpy: 1.24.3
  pandas: 2.0.2
  xarray: 2023.1.1.dev17
  netCDF4: 1.6.2
  packaging: 23.1
  contextily: 1.3.0
  geopandas: 0.13.2
  IPython: 8.14.0
  rioxarray: 0.14.1
  ghostscript: 9.54.0
GMT library information:
  binary version: 6.4.0
  cores: 4
  grid layout: rows
  image layout: 
  library path: C:/ProgramData/Anaconda3/envs/pygmt_env_dev/Library/bin/gmt.dll
  padding: 2
  plugin dir: C:/ProgramData/Anaconda3/envs/pygmt_env_dev/Library/bin/gmt_plugins
  share dir: C:/Program Files (x86)/gmt6/share
  version: 6.4.0
seisman commented 10 months ago

I can reproduce the bug on Linux.

If pd.DataFrame contain integer-only columns, the debug messages are:

plot [DEBUG]: Look for file -5/5/-5/5 in /home/seisman/.gmt
plot [DEBUG]: Look for file -5/5/-5/5 in /home/seisman/.gmt/cache
plot [DEBUG]: Look for file -5/5/-5/5 in /home/seisman/.gmt/server
plot [DEBUG]: Got regular w/e/s/n for region (-5/5/-5/5)
plot [INFORMATION]: Processing input table data
plot [DEBUG]: Operation will require 2 input columns [n_cols_start = 2]
plot [DEBUG]: Reset MAP_ANNOT_OBLIQUE to anywhere
plot [DEBUG]: Projected values in meters: -5 5 -5 5
plot [DEBUG]: Computed automatic parameters using dimension scaling: 0.9
plot [INFORMATION]: Map scale is 0.001 km per cm or 1:100.
plot [DEBUG]: Running in PS mode modern
plot [DEBUG]: Use PS filename /home/seisman/.gmt/sessions/gmt_session.1478556/gmt_1.ps-
plot [DEBUG]: Append to hidden PS file /home/seisman/.gmt/sessions/gmt_session.1478556/gmt_1.ps-
plot [DEBUG]: Got session name as pygmt-session and default graphics formats as pdf
plot [DEBUG]: Basemap order: Frame = above  Grid = below  Tick/Annot = below
plot [DEBUG]: gmtapi_init_import: Passed family = Data Table and geometry = Line
plot [DEBUG]: gmtapi_init_import: Added 1 new sources
plot [DEBUG]: GMT_Init_IO: Returned first Input object ID = 0
plot [DEBUG]: gmtapi_begin_io: Input resource access is now enabled [container]
plot [DEBUG]: gmtapi_import_dataset: Passed ID = -1 and mode = 0
plot [INFORMATION]: Duplicating data table from user 4 column arrays of length 4
plot [DEBUG]: Object ID 1 : Registered Data Table Memory Copy 560d93e51980 as an Input resource with geometry Point [n_objects = 2]
plot [DEBUG]: gmtapi_import_dataset processed 1 resources
plot [DEBUG]: GMT_End_IO: Input resource access is now disabled
plot [INFORMATION]: Plotting segment 0
plot [DEBUG]: GMT_Destroy_Data: freed memory for a Data Table for object 1
plot [DEBUG]: gmtlib_unregister_io: Unregistering object no 1 [n_objects = 1]
plot [DEBUG]: gmtlib_unregister_io: Object no 1 has non-NULL resource pointer
plot [DEBUG]: Current size of half-baked PS file /home/seisman/.gmt/sessions/gmt_session.1478556/gmt_1.ps- = 23633.

If pd.DataFrame contain float-type columns, the debug messages are:

plot [DEBUG]: Look for file -5/5/-5/5 in /home/seisman/.gmt
plot [DEBUG]: Look for file -5/5/-5/5 in /home/seisman/.gmt/cache
plot [DEBUG]: Look for file -5/5/-5/5 in /home/seisman/.gmt/server
plot [DEBUG]: Got regular w/e/s/n for region (-5/5/-5/5)
plot [INFORMATION]: Processing input table data
plot [DEBUG]: Operation will require 2 input columns [n_cols_start = 2]
plot [DEBUG]: Reset MAP_ANNOT_OBLIQUE to anywhere
plot [DEBUG]: Projected values in meters: -5 5 -5 5
plot [DEBUG]: Computed automatic parameters using dimension scaling: 0.9
plot [INFORMATION]: Map scale is 0.001 km per cm or 1:100.
plot [DEBUG]: Running in PS mode modern
plot [DEBUG]: Use PS filename /home/seisman/.gmt/sessions/gmt_session.1479589/gmt_1.ps-
plot [DEBUG]: Append to hidden PS file /home/seisman/.gmt/sessions/gmt_session.1479589/gmt_1.ps-
plot [DEBUG]: Got session name as pygmt-session and default graphics formats as pdf
plot [DEBUG]: Basemap order: Frame = above  Grid = below  Tick/Annot = below
plot [DEBUG]: gmtapi_init_import: Passed family = Data Table and geometry = Line
plot [DEBUG]: gmtapi_init_import: Added 1 new sources
plot [DEBUG]: GMT_Init_IO: Returned first Input object ID = 0
plot [DEBUG]: gmtapi_begin_io: Input resource access is now enabled [container]
plot [DEBUG]: gmtapi_import_dataset: Passed ID = -1 and mode = 0
plot [INFORMATION]: Referencing data table from user 4 column arrays of length 4
plot [DEBUG]: Object ID 1 : Registered Data Table Memory Reference 55c909f971a0 as an Input resource with geometry Point [n_objects = 2]
plot [DEBUG]: gmtapi_import_dataset processed 1 resources
plot [DEBUG]: GMT_End_IO: Input resource access is now disabled
plot [INFORMATION]: Plotting segment 0
free(): invalid next size (fast)

Here is the diff:

< plot [INFORMATION]: Duplicating data table from user 4 column arrays of length 4
< plot [DEBUG]: Object ID 1 : Registered Data Table Memory Copy 560d93e51980 as an Input resource with geometry Point [n_objects = 2]
---
> plot [INFORMATION]: Referencing data table from user 4 column arrays of length 4
> plot [DEBUG]: Object ID 1 : Registered Data Table Memory Reference 55c909f971a0 as an Input resource with geometry Point [n_objects = 2]
26,29c26
< plot [DEBUG]: GMT_Destroy_Data: freed memory for a Data Table for object 1
< plot [DEBUG]: gmtlib_unregister_io: Unregistering object no 1 [n_objects = 1]
< plot [DEBUG]: gmtlib_unregister_io: Object no 1 has non-NULL resource pointer
< plot [DEBUG]: Current size of half-baked PS file /home/seisman/.gmt/sessions/gmt_session.1478556/gmt_1.ps- = 23633.
---
> free(): invalid next size (fast)

So, for the integer-type case, data is duplicated, but for the float-type case, data is used by reference.

@PaulWessel Need your help.

PaulWessel commented 10 months ago

How is the DataFrame passed to GMT? Via matrix? Also, this looks like a bad sign gmtapi_import_dataset: Passed ID = -1 and mode = 0 since ID = -1 means "not set", so that can't be good.

seisman commented 10 months ago

It's passed via GMT_Put_Vectors.

seisman commented 10 months ago

Here are the values used in GMT_Open_Virtualfile

family="GMT_IS_DATASET|GMT_VIA_VECTOR"
geometry="GMT_IS_POINT"
direction="GMT_IN|GMT_IS_REFERENCE"
PaulWessel commented 10 months ago

Not clear. Might you share a minimal example that (1) loads the data frame, (2) passes it to some simple module like gmtconvert (assuming that also crashes)? Think I need to debug.

seisman commented 10 months ago

Might you share a minimal example that (1) loads the data frame, (2) passes it to some simple module like gmtconvert (assuming that also crashes)?

Tried to pass the same dataset to gmtconvert, but it doesn't crash.

import pandas as pd
from pygmt.clib import Session

test_dict_int = {
    'a': [ 2,  2, 2, 2],
    'z': [ 8,  6, 7, 3],
    'x': [-3, -1, 1, 3],
    'y': [ 2,  2, 2, 2],
}
data = pd.DataFrame(data=test_dict_int)

with Session() as lib:
    with lib.virtualfile_from_data(data=data) as vintbl:
        lib.call_module("convert", f"{vintbl} -Vd")

The verbose messages are:

mtconvert [INFORMATION]: Processing input table data
gmtconvert [DEBUG]: gmtapi_init_import: Passed family = Data Table and geometry = Point
gmtconvert [DEBUG]: gmtapi_init_import: Added 1 new sources
gmtconvert [DEBUG]: GMT_Init_IO: Returned first Input object ID = 0
gmtconvert [DEBUG]: gmtapi_begin_io: Input resource access is now enabled [container]
gmtconvert [DEBUG]: gmtapi_import_dataset: Passed ID = -1 and mode = 0
gmtconvert [INFORMATION]: Referencing data table from user 4 column arrays of length 4
gmtconvert [DEBUG]: Object ID 1 : Registered Data Table Memory Reference 555a4fa95820 as an Input resource with geometry Point [n_objects = 2]
gmtconvert [DEBUG]: gmtapi_import_dataset processed 1 resources
gmtconvert [DEBUG]: GMT_End_IO: Input resource access is now disabled
gmtconvert [DEBUG]: Object ID 2 : Registered Data Table Memory Reference 555a4fad6fe0 as an Input resource with geometry Point [n_objects = 3]
gmtconvert [DEBUG]: Successfully duplicated a Data Table
gmtconvert [DEBUG]: Object ID 3 : Registered Data Table Stream 7f3cdcaae780 as an Output resource with geometry Point [n_objects = 4]
gmtconvert [DEBUG]: gmtapi_begin_io: Output resource access is now enabled [container]
gmtconvert [DEBUG]: gmtapi_export_dataset: Passed ID = 3 and mode = 0
gmtconvert [INFORMATION]: Write Data Table to <stdout>
2   8   -3  2
2   6   -1  2
2   7   1   2
2   3   3   2
gmtconvert [DEBUG]: GMT_End_IO: Output resource access is now disabled
gmtconvert [INFORMATION]: 1 tables concatenated, 4 records passed (input cols = 4; output cols = 4)
gmtconvert [DEBUG]: gmtlib_garbage_collection: Destroying object: C=0 A=0 ID=1 W=Input F=Data Table M=Memory Reference S=Used P=555a4fa95820 N=(null)
gmtconvert [DEBUG]: gmtlib_garbage_collection: Destroying object: C=0 A=0 ID=2 W=Input F=Data Table M=Memory Reference S=Unused P=555a4fad6fe0 N=(null)
gmtconvert [DEBUG]: GMTAPI_Garbage_Collection freed 2 memory objects
gmtconvert [DEBUG]: gmtlib_unregister_io: Unregistering object no 1 [n_objects = 3]
gmtconvert [DEBUG]: gmtlib_unregister_io: Unregistering object no 2 [n_objects = 2]
gmtconvert [DEBUG]: gmtlib_unregister_io: Unregistering object no 3 [n_objects = 1]

ID = -1 so it's not the real problem.

seisman commented 10 months ago

For the example in https://github.com/GenericMappingTools/pygmt/issues/2637#issue-1858869645, if I add style="c0.2c" (i.e., -Sc0.2c), the script works. So, it's likely it only crashes when plotting lines.

PaulWessel commented 10 months ago

If I try this:

cat <<- EOF > bug.py
# Set up random test data
import pandas as pd
import pygmt 

size = 5

test_dict_int = {
    'a': [ 2,  2, 2, 2],
    'z': [ 8,  6, 7, 3],
    'x': [-3, -1, 1, 3],
    'y': [ 2,  2, 2, 2],
}
test_df_int = pd.DataFrame(data=test_dict_int)

fig = pygmt.Figure()

fig.basemap(
    region=[-size, size, -size, size],
    projection="X" + str(size*2),
    frame=True,
)

fig.plot(
    # data=test_df_int,  # integers -> WORKs
    data=test_df_int.astype(float),  # floats -> FAILs
    incols=[2, 3],
    # verbose="d",
)

fig.show()
fig.savefig(fname="bug_MWE.png")
EOF

and run

python bug.py

I get no errors and this plot

bug_MWE

What am I missing?

seisman commented 10 months ago

@yvonnefroehlich said "For me, this issue occurs under Windows but not under Linux.". Now I can reproduce the issue under Linux, but you can't reproduce it under macOS.

Need to find out why the behavior is system-dependent.

PaulWessel commented 10 months ago

Need a Linux or Win (@joa-quim ) person to run in debug and determine WTF is going on. I cannot.

joa-quim commented 10 months ago

I could try to start my python learning through a debug session but for that I would need that PyGMT was able to find my gmt.dll (which, ofc, has to be a debug build) as I don't want to mess with Conda and environments stuff.

seisman commented 10 months ago

I would need that PyGMT was able to find my gmt.dll (which, ofc, has to be a debug build)

Just set the GMT_LIBRARY_PATH environment variable to the path to the gmt.dll (something like C:\Users\USERNAME\Mambaforge\envs\pygmt\Library\bin\).

joa-quim commented 10 months ago

OK, but when I said gmt.dll I was being generic. The true name is gmt_w64.dll and if only the path is set via GMT_LIBRARY_PATH then the right dll wont be found.

seisman commented 10 months ago

OK, but when I said gmt.dll I was being generic. The true name is gmt_w64.dll and if only the path is set via GMT_LIBRARY_PATH then the right dll wont be found.

PyGMT will try to find gmt.dll, gmt_w32.dll and gmt_w64.dll

joa-quim commented 10 months ago

Good, thanks.

seisman commented 9 months ago

Tried to debug this issue. It seems plot crashes when trying to free the GMT_DATASET object https://github.com/GenericMappingTools/gmt/blob/7825ff4632c85ef6569acf19192068b977127e07/src/psxy.c#L3008

        if (GMT_Destroy_Data (API, &D) != GMT_NOERROR) {
            Return (API->error);
        }

Actually it crashes in the gmt_free_segment function (https://github.com/GenericMappingTools/gmt/blob/7825ff4632c85ef6569acf19192068b977127e07/src/gmt_io.c#L8875):

    SH = gmt_get_DS_hidden (segment);
    for (col = 0; col < segment->n_columns; col++) {
        if (SH->alloc_mode[col] == GMT_ALLOC_INTERNALLY)    /* Free data GMT allocated */
            gmt_M_free (GMT, segment->data[col]);
    }
    gmt_M_free (GMT, segment->data);  # CRASHES HERE!

SH->alloc_mode[col] are GMT_ALLOC_EXTERNALLY for all the four columns, so segment->data[col] are not freed, but it crashes when freeing segment->data.

seisman commented 9 months ago

Ping @PaulWessel Does the above debugging help?

PaulWessel commented 9 months ago

Yes, I got the same. Will debug again to see exactly that data is internally allocated. Not sure why that would crash, but it does.

PaulWessel commented 9 months ago

Since I cannot reproduce it (works for macOS) and I cannot see why this would depend on the OS I cannot really help. Someone would need to debug in Linux but not sure what to look fore. segment->data is allocated in GMT so fair game to free as long as we dont free the read-only vectors.