GenericMappingTools / pygmt

A Python interface for the Generic Mapping Tools.
https://www.pygmt.org
BSD 3-Clause "New" or "Revised" License
758 stars 220 forks source link

Full support of numpy dtypes #2848

Open seisman opened 11 months ago

seisman commented 11 months ago

Before starting to support PyArrow arrays (#2800), I think we should first ensure that PyGMT has complete support for NumPy dtypes. Currently, the following NumPy dtypes are supported:

https://github.com/GenericMappingTools/pygmt/blob/e5ecee98e316da0ec9176286f5d1bfb5623bcbc7/pygmt/clib/session.py#L85-L102

Here is a simple way to list all the available NumPy dtypes (xref: https://numpy.org/doc/stable/reference/arrays.scalars.html):

>>> import numpy as np
>>> set(np.sctypeDict.values())

As you can see, most NumPy types are already supported in PyGMT. It's clear that GMT and PyGMT can't support float128 and any complex floating-point types, but at least it's possible to support np.float16 (np.half), np.bool_ and np.timedelta.

weiji14 commented 11 months ago
  • For np.float16: maybe we have to cast the array into np.float32?
  • For np.bool_: maybe convert to 0 and 1 (e.g., array.astype("int8")

Personally use float16 a lot, but if we're just going to cast the array to float32, that would result in an increase in memory usage, and the user should be warned or know about that. Same for bool -> int8. Considering that GMT doesn't yet support these dtypes, maybe it's better to have the user do the dtype casting themselves rather than PyGMT doing it automatically?

For np.timedelta: Not sure about this yet.

This is an interesting one. I've had to make plots for showing how a variable like RMSE changes over time from Day 0 to Day 10. The below one is using seaborn:

rmse_huv500

Note though, the 1e14 scaling at the bottom right. I couldn't quite find a way to remove that...

I think I did try to do this with GMT/PyGMT at first, but couldn't work out if GMT actually supports plotting relative time. Maybe there's a way to do it by changing the configs at https://docs.generic-mapping-tools.org/6.4/gmt.conf.html#calendar-time-parameters? But I'm not sure how we can convert np.timedelta into something GMT understands.

seisman commented 11 months ago

Considering that GMT doesn't yet support these dtypes, maybe it's better to have the user do the dtype casting themselves rather than PyGMT doing it automatically?

The current issue is, the error message is helpless and very confusing, so users don't know that np.float16 is not supported:

>>> import pygmt
>>> import numpy as np
>>> x = np.array([1, 2, 3, 4], dtype=np.float16)
>>>> pygmt.info(x)
File ~/OSS/gmt/pygmt/pygmt/clib/session.py:1275, in Session.virtualfile_from_vectors(self, *vectors)
   1273 # Use put_vector for columns with numerical type data
   1274 for col, array in enumerate(arrays[:columns]):
-> 1275     self.put_vector(dataset, column=col, vector=array)
   1277 # Use put_strings for last column(s) with string type data
   1278 # Have to use modifier "GMT_IS_DUPLICATE" to duplicate the strings
   1279 string_arrays = arrays[columns:]

File ~/OSS/gmt/pygmt/pygmt/clib/session.py:892, in Session.put_vector(self, dataset, column, vector)
    889 vector_pointer = (ctp.c_char_p * len(vector))()
    890 if gmt_type == self["GMT_DATETIME"]:
    891     vector_pointer[:] = np.char.encode(
--> 892         np.datetime_as_string(array_to_datetime(vector))
    893     )
    894 else:
    895     vector_pointer[:] = np.char.encode(vector)

ValueError: Cannot convert a NumPy datetime value other than NaT with generic units

Personally use float16 a lot, but if we're just going to cast the array to float32, that would result in an increase in memory usage, and the user should be warned or know about that. Same for bool -> int8.

I don't think users should be warned for a memory usage increase. GMT does a lot of memory duplication internally and no one cares about it. Instead, we should warn users if we make any changes (e.g., casting types) to the input data.

seisman commented 11 months ago

Considering that GMT doesn't yet support these dtypes, maybe it's better to have the user do the dtype casting themselves rather than PyGMT doing it automatically?

After improving the error messages in PR #2856, I'm convinced that it's better to NOT support dtypes like np.float16, np.bytes_ and np.bool_.

weiji14 commented 11 months ago

Ok, shall we rescope this issue to just focus on supporting np.timedelta then?

seisman commented 11 months ago

Here is a GMT CLI script showing how GMT deal with relative times:

gmt begin times png
gmt set TIME_UNIT d TIME_EPOCH 2023-01-01T00:00:00
gmt basemap -R0t/10t/0/1 -JX10c/5c -Bxa2D -Byaf
gmt plot -Sc0.2c -f0t,1y << EOF
1 0.2
2 0.5
3 0.8
EOF
gmt end show

It produces: times

-f0t tells GMT that the first column contains relative times. Here, 1 means 1 day (TIME_UNIT) since 2023-01-01T00:00:00 (TIME_EPOCH). Then, GMT converts the relative times to absolute times before plotting.

To support np.timedelta64 dtype in PyGMT, we need to:

  1. Get the current value of TIME_EPOCH.
  2. Add the TIME_EPOCH to the np.deltatime64 object.

However, in the Python world, converting np.timedelta64 object to np.datetime64 object is easy and straightforward. Thus, I feel it makes little sense to support np.timedelta64 in PyGMT. Thus, I feel it makes no sense to do the conversion internally in PyGMT.

seisman commented 11 months ago

I've had to make plots for showing how a variable like RMSE changes over time from Day 0 to Day 10. The below one is using seaborn:

rmse_huv500

Note though, the 1e14 scaling at the bottom right. I couldn't quite find a way to remove that...

I think I did try to do this with GMT/PyGMT at first, but couldn't work out if GMT actually supports plotting relative time. Maybe there's a way to do it by changing the configs at https://docs.generic-mapping-tools.org/6.4/gmt.conf.html#calendar-time-parameters? But I'm not sure how we can convert np.timedelta into something GMT understands.

For you case, I think you're expecting a relative-time axis.

>>> import pandas as pd
>>> import numpy as np
>>> data = pd.timedelta_range(start="1 day", periods=10)
>>> data
TimedeltaIndex([ '1 days',  '2 days',  '3 days',  '4 days',  '5 days',
                 '6 days',  '7 days',  '8 days',  '9 days', '10 days'],
               dtype='timedelta64[ns]', freq='D')
>>> data.to_numpy()
array([ 86400000000000, 172800000000000, 259200000000000, 345600000000000,
       432000000000000, 518400000000000, 604800000000000, 691200000000000,
       777600000000000, 864000000000000], dtype='timedelta64[ns]')
>>> data.to_numpy(dtype="timedelta64[D]")
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype='timedelta64[D]')

I feel we should convert np.timedelta64 dtype to the most suitable int type.

seisman commented 11 months ago

It's straightforward to support np.timedelta64 type. We just need to add np.timedelta64 : "GMT_LONG" to https://github.com/GenericMappingTools/pygmt/blob/78dfcf13718f56a755cc16b03ac29029f26235ec/pygmt/clib/session.py#L75-L88

Then the following codes work:

>>> import pygmt
>>> import numpy as np
>>> data = np.arange(np.timedelta64(1, "D"), np.timedelta64(10, "D"))
>>> data
array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='timedelta64[D]')
>>> pygmt.info(data)
'<vector memory>: N = 9 <1/9>\n'

>>> data = data.astype("timedelta64[s]")
>>> pygmt.info(data)
'<vector memory>: N = 9 <86400/777600>\n'

>>> data = data.astype("timedelta64[ns]")
>>> pygmt.info(data)
'<vector memory>: N = 9 <8.64e+13/7.776e+14>\n'
weiji14 commented 11 months ago

Ok, started a PR for this at #2884, and managed to recreate my example plot above. Example code:

import numpy as np
import pygmt

fig = pygmt.Figure()
fig.basemap(
    projection="X8c/5c",
    region=[0, 8, 0, 10],
    frame=["WSne", "xaf+lForecast Days", "yaf+lRMSE"],
)
fig.plot(
    x=np.arange(np.timedelta64(0, "D"), np.timedelta64(8, "D")),
    y=np.geomspace(start=0.1, stop=9, num=8),
    style="c0.2c",
    pen="1p",
)
fig.show()

produces

forecast_rmse

However, passing np.timedelta64 to the region parameter of basemap doesn't work. It gets interpreted as a datetime.

fig = pygmt.Figure()
fig.basemap(
    projection="X8c/5c",
    region=[np.timedelta64(0, "D"), np.timedelta64(8, "D"), 0, 10],
    frame=["WSne", "xaf+lForecast Days", "yaf+lRMSE"],
)
fig.plot(
    x=np.arange(np.timedelta64(0, "D"), np.timedelta64(8, "D")),
    y=np.geomspace(start=0.1, stop=9, num=8),
    style="c0.2c",
    pen="1p",
)
fig.show()

produces

forecast_rmse2

Not sure why timedelta64 gets cast to a datetime dtype in region. Looks like we might need to do some extra handling?

weiji14 commented 10 months ago

Not sure why timedelta64 gets cast to a datetime dtype in region. Looks like we might need to do some extra handling?

Figured out how we could use timedelta64 in region/-R at d5bedc2ac71a337f2b2952843778d2e8c78c29a5, but the code is not very nice. Best to handle that logic separately after #2884 is merged, as mentioned in https://github.com/GenericMappingTools/pygmt/pull/2884#discussion_r1429007513.


Another thing to consider - do we want to also support Python's built-in datetime.timedelta object? Quoting from https://github.com/GenericMappingTools/pygmt/pull/2884#discussion_r1428745027:

Apparently Python has a built-in datetime.timedelta objects, but it's super hard to cast it to an integer type while preserving the orignal unit (e.g. day, hour, minute, etc). E.g.

import datetime

td = datetime.timedelta(hours=24)

print(np.timedelta64(td))
# numpy.timedelta64(86400000000,'us')
print(np.timedelta64(td).astype(int))
# 86400000000

Should we still support datetime.timedelta? Or only np.timedelta64?

seisman commented 10 months ago

Another thing to consider - do we want to also support Python's built-in datetime.timedelta object? Quoting from #2884 (comment):

Apparently Python has a built-in datetime.timedelta objects, but it's super hard to cast it to an integer type while preserving the orignal unit (e.g. day, hour, minute, etc). E.g.

import datetime

td = datetime.timedelta(hours=24)

print(np.timedelta64(td))
# numpy.timedelta64(86400000000,'us')
print(np.timedelta64(td).astype(int))
# 86400000000

Should we still support datetime.timedelta? Or only np.timedelta64?

I guess not.

seisman commented 1 week ago

I've updated the top post (https://github.com/GenericMappingTools/pygmt/issues/2848#issue-2023698530) based on what I learned recently about NumPy dtypes. After PR #3566, there are still a few unsupported dtypes that can be supported.

  1. np.float16: Cast to np.float32 and warn about memory increase
  2. np.longdouble: Cast to np.float64 and warn about data loss
  3. np.bool: Cast to np.uint8 and warn about memory increase
  4. np.datetime64: Already supported, but unsure about the support of various date/time units
  5. np.timedelta64: Partially supported in #2884, but there are two remaining issues:
    • data are passed as 64-bit integers to the GMT C API, and we can't (at least we don't know how) tell GMT that the data is relative times.
    • Support timedelta/np.timedelta64 in the region parameter

For np.float16/np.longdouble/np.bool, currently, we raise an exception saying that the dtype is unsupported but we don't mention what to do. Instead of raising an exception and exiting, I still think we should cast them into recognized dtypes and raise a warning to let users know what PyGMT is doing and what they can do to suppress the error.