Open seisman opened 11 months ago
- For
np.float16
: maybe we have to cast the array intonp.float32
?- For
np.bool_
: maybe convert to 0 and 1 (e.g.,array.astype("int8")
Personally use float16
a lot, but if we're just going to cast the array to float32
, that would result in an increase in memory usage, and the user should be warned or know about that. Same for bool -> int8. Considering that GMT doesn't yet support these dtypes, maybe it's better to have the user do the dtype casting themselves rather than PyGMT doing it automatically?
For
np.timedelta
: Not sure about this yet.
This is an interesting one. I've had to make plots for showing how a variable like RMSE changes over time from Day 0 to Day 10. The below one is using seaborn:
Note though, the 1e14
scaling at the bottom right. I couldn't quite find a way to remove that...
I think I did try to do this with GMT/PyGMT at first, but couldn't work out if GMT actually supports plotting relative time. Maybe there's a way to do it by changing the configs at https://docs.generic-mapping-tools.org/6.4/gmt.conf.html#calendar-time-parameters? But I'm not sure how we can convert np.timedelta
into something GMT understands.
Considering that GMT doesn't yet support these dtypes, maybe it's better to have the user do the dtype casting themselves rather than PyGMT doing it automatically?
The current issue is, the error message is helpless and very confusing, so users don't know that np.float16
is not supported:
>>> import pygmt
>>> import numpy as np
>>> x = np.array([1, 2, 3, 4], dtype=np.float16)
>>>> pygmt.info(x)
File ~/OSS/gmt/pygmt/pygmt/clib/session.py:1275, in Session.virtualfile_from_vectors(self, *vectors)
1273 # Use put_vector for columns with numerical type data
1274 for col, array in enumerate(arrays[:columns]):
-> 1275 self.put_vector(dataset, column=col, vector=array)
1277 # Use put_strings for last column(s) with string type data
1278 # Have to use modifier "GMT_IS_DUPLICATE" to duplicate the strings
1279 string_arrays = arrays[columns:]
File ~/OSS/gmt/pygmt/pygmt/clib/session.py:892, in Session.put_vector(self, dataset, column, vector)
889 vector_pointer = (ctp.c_char_p * len(vector))()
890 if gmt_type == self["GMT_DATETIME"]:
891 vector_pointer[:] = np.char.encode(
--> 892 np.datetime_as_string(array_to_datetime(vector))
893 )
894 else:
895 vector_pointer[:] = np.char.encode(vector)
ValueError: Cannot convert a NumPy datetime value other than NaT with generic units
Personally use
float16
a lot, but if we're just going to cast the array tofloat32
, that would result in an increase in memory usage, and the user should be warned or know about that. Same for bool -> int8.
I don't think users should be warned for a memory usage increase. GMT does a lot of memory duplication internally and no one cares about it. Instead, we should warn users if we make any changes (e.g., casting types) to the input data.
Considering that GMT doesn't yet support these dtypes, maybe it's better to have the user do the dtype casting themselves rather than PyGMT doing it automatically?
After improving the error messages in PR #2856, I'm convinced that it's better to NOT support dtypes like np.float16
, np.bytes_
and np.bool_
.
Ok, shall we rescope this issue to just focus on supporting np.timedelta
then?
Here is a GMT CLI script showing how GMT deal with relative times:
gmt begin times png
gmt set TIME_UNIT d TIME_EPOCH 2023-01-01T00:00:00
gmt basemap -R0t/10t/0/1 -JX10c/5c -Bxa2D -Byaf
gmt plot -Sc0.2c -f0t,1y << EOF
1 0.2
2 0.5
3 0.8
EOF
gmt end show
It produces:
-f0t
tells GMT that the first column contains relative times. Here, 1
means 1 day (TIME_UNIT) since 2023-01-01T00:00:00 (TIME_EPOCH). Then, GMT converts the relative times to absolute times before plotting.
To support np.timedelta64
dtype in PyGMT, we need to:
TIME_EPOCH
. TIME_EPOCH
to the np.deltatime64
object.However, in the Python world, converting np.timedelta64
object to np.datetime64
object is easy and straightforward. Thus, I feel it makes little sense to support Thus, I feel it makes no sense to do the conversion internally in PyGMT.np.timedelta64
in PyGMT.
I've had to make plots for showing how a variable like RMSE changes over time from Day 0 to Day 10. The below one is using seaborn:
Note though, the
1e14
scaling at the bottom right. I couldn't quite find a way to remove that...I think I did try to do this with GMT/PyGMT at first, but couldn't work out if GMT actually supports plotting relative time. Maybe there's a way to do it by changing the configs at https://docs.generic-mapping-tools.org/6.4/gmt.conf.html#calendar-time-parameters? But I'm not sure how we can convert
np.timedelta
into something GMT understands.
For you case, I think you're expecting a relative-time axis.
>>> import pandas as pd
>>> import numpy as np
>>> data = pd.timedelta_range(start="1 day", periods=10)
>>> data
TimedeltaIndex([ '1 days', '2 days', '3 days', '4 days', '5 days',
'6 days', '7 days', '8 days', '9 days', '10 days'],
dtype='timedelta64[ns]', freq='D')
>>> data.to_numpy()
array([ 86400000000000, 172800000000000, 259200000000000, 345600000000000,
432000000000000, 518400000000000, 604800000000000, 691200000000000,
777600000000000, 864000000000000], dtype='timedelta64[ns]')
>>> data.to_numpy(dtype="timedelta64[D]")
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='timedelta64[D]')
I feel we should convert np.timedelta64
dtype to the most suitable int
type.
It's straightforward to support np.timedelta64
type. We just need to add np.timedelta64 : "GMT_LONG"
to https://github.com/GenericMappingTools/pygmt/blob/78dfcf13718f56a755cc16b03ac29029f26235ec/pygmt/clib/session.py#L75-L88
Then the following codes work:
>>> import pygmt
>>> import numpy as np
>>> data = np.arange(np.timedelta64(1, "D"), np.timedelta64(10, "D"))
>>> data
array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='timedelta64[D]')
>>> pygmt.info(data)
'<vector memory>: N = 9 <1/9>\n'
>>> data = data.astype("timedelta64[s]")
>>> pygmt.info(data)
'<vector memory>: N = 9 <86400/777600>\n'
>>> data = data.astype("timedelta64[ns]")
>>> pygmt.info(data)
'<vector memory>: N = 9 <8.64e+13/7.776e+14>\n'
Ok, started a PR for this at #2884, and managed to recreate my example plot above. Example code:
import numpy as np
import pygmt
fig = pygmt.Figure()
fig.basemap(
projection="X8c/5c",
region=[0, 8, 0, 10],
frame=["WSne", "xaf+lForecast Days", "yaf+lRMSE"],
)
fig.plot(
x=np.arange(np.timedelta64(0, "D"), np.timedelta64(8, "D")),
y=np.geomspace(start=0.1, stop=9, num=8),
style="c0.2c",
pen="1p",
)
fig.show()
produces
However, passing np.timedelta64
to the region
parameter of basemap doesn't work. It gets interpreted as a datetime.
fig = pygmt.Figure()
fig.basemap(
projection="X8c/5c",
region=[np.timedelta64(0, "D"), np.timedelta64(8, "D"), 0, 10],
frame=["WSne", "xaf+lForecast Days", "yaf+lRMSE"],
)
fig.plot(
x=np.arange(np.timedelta64(0, "D"), np.timedelta64(8, "D")),
y=np.geomspace(start=0.1, stop=9, num=8),
style="c0.2c",
pen="1p",
)
fig.show()
produces
Not sure why timedelta64
gets cast to a datetime
dtype in region. Looks like we might need to do some extra handling?
Not sure why
timedelta64
gets cast to adatetime
dtype in region. Looks like we might need to do some extra handling?
Figured out how we could use timedelta64
in region
/-R
at d5bedc2ac71a337f2b2952843778d2e8c78c29a5, but the code is not very nice. Best to handle that logic separately after #2884 is merged, as mentioned in https://github.com/GenericMappingTools/pygmt/pull/2884#discussion_r1429007513.
Another thing to consider - do we want to also support Python's built-in datetime.timedelta object? Quoting from https://github.com/GenericMappingTools/pygmt/pull/2884#discussion_r1428745027:
Apparently Python has a built-in
datetime.timedelta
objects, but it's super hard to cast it to an integer type while preserving the orignal unit (e.g. day, hour, minute, etc). E.g.import datetime td = datetime.timedelta(hours=24) print(np.timedelta64(td)) # numpy.timedelta64(86400000000,'us') print(np.timedelta64(td).astype(int)) # 86400000000
Should we still support
datetime.timedelta
? Or onlynp.timedelta64
?
Another thing to consider - do we want to also support Python's built-in datetime.timedelta object? Quoting from #2884 (comment):
Apparently Python has a built-in
datetime.timedelta
objects, but it's super hard to cast it to an integer type while preserving the orignal unit (e.g. day, hour, minute, etc). E.g.import datetime td = datetime.timedelta(hours=24) print(np.timedelta64(td)) # numpy.timedelta64(86400000000,'us') print(np.timedelta64(td).astype(int)) # 86400000000
Should we still support
datetime.timedelta
? Or onlynp.timedelta64
?
I guess not.
I've updated the top post (https://github.com/GenericMappingTools/pygmt/issues/2848#issue-2023698530) based on what I learned recently about NumPy dtypes. After PR #3566, there are still a few unsupported dtypes that can be supported.
np.float16
: Cast to np.float32
and warn about memory increasenp.longdouble
: Cast to np.float64
and warn about data lossnp.bool
: Cast to np.uint8
and warn about memory increasenp.datetime64
: Already supported, but unsure about the support of various date/time unitsnp.timedelta64
: Partially supported in #2884, but there are two remaining issues:
timedelta
/np.timedelta64
in the region
parameterFor np.float16
/np.longdouble
/np.bool
, currently, we raise an exception saying that the dtype is unsupported but we don't mention what to do. Instead of raising an exception and exiting, I still think we should cast them into recognized dtypes and raise a warning to let users know what PyGMT is doing and what they can do to suppress the error.
Before starting to support PyArrow arrays (#2800), I think we should first ensure that PyGMT has complete support for NumPy dtypes. Currently, the following NumPy dtypes are supported:
https://github.com/GenericMappingTools/pygmt/blob/e5ecee98e316da0ec9176286f5d1bfb5623bcbc7/pygmt/clib/session.py#L85-L102
Here is a simple way to list all the available NumPy dtypes (xref: https://numpy.org/doc/stable/reference/arrays.scalars.html):
numpy.int8
:GMT_CHAR
numpy.int16
:GMT_SHORT
numpy.int32
:GMT_INT
numpy.int64
:GMT_LONG
numpy.uint8
:GMT_UCHAR
numpy.uint16
:GMT_USHORT
numpy.uint32
:GMT_UINT
numpy.uint64
:GMT_ULONG
numpy.float16
numpy.float32
:GMT_FLOAT
numpy.float64
:GMT_DOUBLE
numpy.longlong
:GMT_LONG
in #3566numpy.ulonglong
:GMT_ULONG
in #3566numpy.longdouble
: alias for np.float96 or np.float128 (xref: https://numpy.org/devdocs/reference/arrays.scalars.html#numpy.longdouble): GMT doesn't support complex floating point data types.numpy.complex64
: GMT doesn't support complex floating point data types.numpy.complex128
: GMT doesn't support complex floating point data types.numpy.clongdouble
numpy.bool
: For bytestring. Makes no sense in GMT.numpy.bytes_
numpy.str_
:GMT_TEXT
numpy.datetime64
:GMT_DATETIME
. But still need to check all time units (https://numpy.org/doc/stable/reference/arrays.datetime.html#datetime-units)numpy.timedelta64
:GMT_LONG
#2884numpy.object_
: Any Python objects. We should try to convert it to np.datetime64 or np.str_.: For creating a new structured or unstructured void scalar. Make no sense in GMT.numpy.void
As you can see, most NumPy types are already supported in PyGMT. It's clear that GMT and PyGMT can't support float128 and any complex floating-point types, but at least it's possible to support
np.float16
(np.half
),np.bool_
andnp.timedelta
.np.float16
: maybe we have to cast the array intonp.float32
?np.bool_
: maybe convert to 0 and 1 (e.g.,array.astype("int8")
np.timedelta
: Not sure about this yet.