GenericMappingTools / pygmt

A Python interface for the Generic Mapping Tools.
https://www.pygmt.org
BSD 3-Clause "New" or "Revised" License
747 stars 216 forks source link

Running info on pandas.DataFrame with time column doesn't work #597

Closed weiji14 closed 3 years ago

weiji14 commented 4 years ago

Description of the problem

Just noticed that datetime columns being passed into pygmt.info doesn't work. This follows on from the pandas.DataFrame inputs into pygmt.info functionality added in #574, see also #464 and #562 where the datetime machinery should be more or less implemented.

Full code that generated the error

import pygmt
import pandas as pd

table = pd.DataFrame(data=[1,3,2,5,4], columns=["z"])
table["time"] = pd.date_range(start="2020-01-01", periods=5)

pygmt.info(table=table)

Note that the equivalent gmt command does work on datetime inputs.

!gmt info temp.txt
temp.txt: N = 5 <1/5>   <2020-01-01T00:00:00/2020-01-05T00:00:00>

Full error message

---------------------------------------------------------------------------
GMTCLibError                              Traceback (most recent call last)
<ipython-input-6-a9e68c3dc07e> in <module>
----> 1 pygmt.info(table=table)

~/pygmt/pygmt/helpers/decorators.py in new_module(*args, **kwargs)
    235                 if alias in kwargs:
    236                     kwargs[arg] = kwargs.pop(alias)
--> 237             return module_func(*args, **kwargs)
    238 
    239         new_module.aliases = aliases

~/pygmt/pygmt/modules.py in info(table, **kwargs)
    116 
    117         with GMTTempFile() as tmpfile:
--> 118             with file_context as fname:
    119                 arg_str = " ".join(
    120                     [fname, build_arg_string(kwargs), "->" + tmpfile.name]

~/miniconda3/envs/pygmt/lib/python3.7/contextlib.py in __enter__(self)
    110         del self.args, self.kwds, self.func
    111         try:
--> 112             return next(self.gen)
    113         except StopIteration:
    114             raise RuntimeError("generator didn't yield") from None

~/pygmt/pygmt/clib/session.py in virtualfile_from_matrix(self, matrix)
   1268         )
   1269 
-> 1270         self.put_matrix(dataset, matrix)
   1271 
   1272         with self.open_virtual_file(

~/pygmt/pygmt/clib/session.py in put_matrix(self, dataset, matrix, pad)
    906         )
    907         if status != 0:
--> 908             raise GMTCLibError("Failed to put matrix of type {}.".format(matrix.dtype))
    909 
    910     def write_data(self, family, geometry, mode, wesn, output, data):

GMTCLibError: Failed to put matrix of type object.

System information

Please paste the output of python -c "import pygmt; pygmt.show_versions()":

PyGMT information:
  version: v0.1.2+55.g6deb388
System information:
  python: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08)  [GCC 7.5.0]
  executable: ~/miniconda3/envs/pygmt/bin/python
  machine: Linux-4.19.0-8-amd64-x86_64-with-debian-10.5
Dependency information:
  numpy: 1.19.1
  pandas: 1.1.1
  xarray: 0.16.0
  netCDF4: 1.5.3
  packaging: 20.4
  ghostscript: 9.27
  gmt: None
GMT library information:
  binary dir: ~/miniconda3/envs/pygmt/bin
  cores: 2
  grid layout: rows
  library path: ~/miniconda3/envs/pygmt/lib/libgmt.so
  padding: 2
  plugin dir: ~/miniconda3/envs/pygmt/lib/gmt/plugins
  share dir: ~/miniconda3/envs/pygmt/share/gmt
  version: 6.1.1
seisman commented 4 years ago

Here is the definition of the GMT_Put_Matrix function:

int GMT_Put_Matrix (void *API, struct GMT_MATRIX *M, unsigned int type, int pad, void *matrix) 

The third parameter type is the data type of the matrix, e.g., GMT_DOUBLE, GMT_FLOAT. It also means that all elements of the matrix must have the exact same data type. Thus, in PyGMT, we can't pass 2D numpy arrays with mixed data types to put_matrix function.

The fix seems easy. We may have to pass 2D arrays as a series of vectors, via virtualfile_from_vectors.

seisman commented 4 years ago

Ping @weiji14.

weiji14 commented 4 years ago

The third parameter type is the data type of the matrix, e.g., GMT_DOUBLE, GMT_FLOAT. It also means that all elements of the matrix must have the exact same data type. Thus, in PyGMT, we can't pass 2D numpy arrays with mixed data types to put_matrix function.

The fix seems easy. We may have to pass 2D arrays as a series of vectors, via virtualfile_from_vectors.

Right, so we'll need to have something like an if-then or try-except to handle mixed dtypes. A couple of other details to consider:

  1. Do we switch to using put_vectors for info all the time (will involve a for-loop), or do we check if dtypes are mixed, then use put_vectors, else use put_matrix as per usual.

Note that numpy.arrays always have the same dtype, it will just be np.object if dtypes are mixed. pandas.DataFrames are the ones that can explicitly have different dtypes in different columns.

  1. Should we generalize info to handle/support other mixed dtype combinations (e.g. int32/float32/etc) properly, thinking about #547 here.

I've got a unit test for this written up already and will submit a PR soon, just need to work out these implementation details :smile:.

weiji14 commented 4 years ago

Just following up on this, we've merged in #619 so if you install PyGMT from the master branch, passing in datetime inputs won't result in "GMTCLibError: Failed to put matrix of type object." anymore. However, the datetime column's ranges will be reported in UNIX timestamps instead of ISO datetimes.

A workaround for this as mentioned at https://github.com/GenericMappingTools/gmt/issues/4241#issuecomment-695958278 is to use something like pygmt.info(table=df, f="1T"), which would explicitly tell GMT that the second column is a datetime type, and should be handled that way.

We will close this issue once this upstream GMT issue at https://github.com/GenericMappingTools/gmt/issues/4241 is resolved, and perhaps when PyGMT bumps the minimum required version to GMT 6.2.0 and/or when conda GMT 6.2.0.dev builds are available with https://github.com/conda-forge/gmt-feedstock/pull/100.

weiji14 commented 3 years ago

A workaround for this as mentioned at GenericMappingTools/gmt#4241 (comment) is to use something like pygmt.info(table=df, f="1T"), which would explicitly tell GMT that the second column is a datetime type, and should be handled that way.

So the workaround doesn't quite work because of the way we've implemented things in #619 using np.loadtxt:

import pandas as pd
import pygmt

table = pd.date_range(start="2010-01-01", end="2020-01-01")
pygmt.info(table=table, spacing="1Y", f="0T")

errors with:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-88-cac984c8d7d8> in <module>
----> 1 pygmt.info(table=df[[time_var, elev_var]], spacing=f"1W/{spacing}", f="0T")

~/miniconda3/envs/pygmt/src/pygmt/pygmt/helpers/decorators.py in new_module(*args, **kwargs)
    268                 if alias in kwargs:
    269                     kwargs[arg] = kwargs.pop(alias)
--> 270             return module_func(*args, **kwargs)
    271 
    272         new_module.aliases = aliases

~/miniconda3/envs/pygmt/src/pygmt/pygmt/modules.py in info(table, **kwargs)
    137             if result.startswith(("-R", "-T")):  # e.g. -R0/1/2/3 or -T0/9/1
    138                 result = result[2:].replace("/", " ")
--> 139             result = np.loadtxt(result.splitlines())
    140 
    141         return result

~/miniconda3/envs/pygmt/lib/python3.8/site-packages/numpy/lib/npyio.py in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows)
   1137         # converting the data
   1138         X = None
-> 1139         for x in read_data(_loadtxt_chunksize):
   1140             if X is None:
   1141                 X = np.array(x, dtype)

~/miniconda3/envs/pygmt/lib/python3.8/site-packages/numpy/lib/npyio.py in read_data(chunk_size)
   1065 
   1066             # Convert each value according to its column and store
-> 1067             items = [conv(val) for (conv, val) in zip(converters, vals)]
   1068 
   1069             # Then pack it according to the dtype's nesting

~/miniconda3/envs/pygmt/lib/python3.8/site-packages/numpy/lib/npyio.py in <listcomp>(.0)
   1065 
   1066             # Convert each value according to its column and store
-> 1067             items = [conv(val) for (conv, val) in zip(converters, vals)]
   1068 
   1069             # Then pack it according to the dtype's nesting

~/miniconda3/envs/pygmt/lib/python3.8/site-packages/numpy/lib/npyio.py in floatconv(x)
    761         if '0x' in x:
    762             return float.fromhex(x)
--> 763         return float(x)
    764 
    765     typ = dtype.type

ValueError: could not convert string to float: '2019-05-19T20:53:51'

np.loadtxt assumes that the text are to be read as floating point numbers, but datetimes like "2019-05-19T20:53:51" are not floats. We'll need to set the dtype using np.loadtxt(..., dtype=???), where ??? is "str,float" or something (ref https://stackoverflow.com/a/31554777/6611055).

https://github.com/GenericMappingTools/pygmt/blob/c7c5eaecf8fedbca584743f6adca4a378851ba9a/pygmt/modules.py#L139

weiji14 commented 3 years ago

Alright, with #960 merged. Anyone installing PyGMT from the master branch (see https://www.pygmt.org/v0.3.0/install.html#using-pip) should be able to use the coltypes="0T" GMT 6.1.1 workaround (where 0T means the first column contains time), i.e.:

import pandas as pd
import pygmt

table = pd.date_range(start="2010-01-01", end="2020-01-01")
region = pygmt.info(table=table, spacing="1Y", coltypes="0T")
print(region)
# ['2010-01-01T00:00:00' '2020-01-01T00:00:00' '0' '0']

Assuming that https://github.com/GenericMappingTools/gmt/issues/4241 is resolved in GMT 6.2.0, then GMT 6.2.0 users won't need to use the coltypes parameter in the future (saves people from needing to know what is the number of the time column).

weiji14 commented 3 years ago

FYI, https://github.com/GenericMappingTools/gmt/issues/4241 has been magically resolved, so this issue can be resolved when PyGMT bumps the minimum version to GMT 6.2.0!

maxrjones commented 3 years ago

Fixed by https://github.com/GenericMappingTools/gmt/pull/4849

weiji14 commented 3 years ago

Phew, thanks team, glad to close down another >6 month old issue!