I went through the updates and collected some changes that seem they may be relevant to STARE.
Dependency Updates
Name
Version
astropy
5.2.2
cartopy
0.21.1
gdal
3.6.3
geopandas
0.12.2
geopandas-base
0.12.2
geos
3.11.2
h5py
3.8.0
hdf4
4.2.15
hdf5
1.12.2
hdfeos2
2.20
matplotlib
3.7.1
numpy
1.24.2
pandas
2.0.0
proj
9.1.1
pygeos
0.14
pyhdf
0.10.5
pyproj
3.5.0
pyshp
2.3.1
pytest
7.3.1
python
3.11.3
shapely
2.0.1
STARE
Installs
pystare
0.8.12 STARE
staremaster
0.0.4 STARE
starepandas
0.6.6 STARE
Notes:
GeoPandas, PyGEOS and Shapely 2.0
GeoPandas has deprecated support for the PyGEOS backend in favor of Shapely 2.0 (which has merged with PyGEOS).
Control this with the directive geopandas.options.use_pygeos = True/False
Setting an environment variable USE_PYGEOS=0/1.
PyGEOS was merged with Shapely in December 2021 and has been released as part of Shapely 2.0.
Migrating to Shapely 2.0: This is a major release with a refactor of the internals with considerable performance improvements and with several breaking changes.
Geometry objects have become immutable.
In-place changes to coordinates are no longer allowed.
Assigning custom attributes are no longer allowed.
Multi-part geometries (MultiPoint, MultiLineString, MultiPolygon and GeometryCollection) are no longer be list-like 'sequences' (length, iterable, indexable).
So for some mp = MultiPoint object you can no longer use operations such as for part in mp:, mp[1], len(mp) or list(mp) etc.
Instead you have to use the geoms property of mp. For example, for part in mp.geoms:, mp.geoms[1], len(mp.geoms) or list(mp.geoms) etc.
Interoperability with NumPy
Shapely provides an interface to access coordinates as NumPy arrays.
For example line = LineString then line_coords = np.asarray(line) or line_coords = np.array(line.coords)
Consistent creation of empty geometries
Shapely nows consistently gives an empty geometry object of the correct type, instead of using an empty GeometryCollection as a generic empty geometry object.
Deprecated Functionality
The empty() method on a geometry object is deprecated.
The shapely.ops.cascaded_union function is deprecated. Use shapely.ops.unary_union instead.
Pandas 2.0 has adopted Apache Arrow as a backend (columnar memory format) rather than using Numpy.
Arrow allows for vast improvements when operating on string-columns over Numpy.
Previously, columns with strings were case as having the object dtype as required by numpy.
Now, you can use dtype="string", dtype=pd.StringDtype() or .astype("string") to create a string-based column.
A PyArrow-backed column can be requested specifically by casting to or specifying a column's dtype as f"{dtype}[pyarrow]", e.g. "int64[pyarrow]"for an integer column.
Alternatively, a PyArrow dtype can be created dtype = pandas.ArrowDtype(pyarrow.int64)
Representation of "Missing values" (None)
Previously, pandas used numpy nan to represent missing values, but because nan are np.float64 any numeric column with missing values was converted to np.float64.
With Arrow, missing values can be represented with a python None, which preserves the column's data-type.
Index can now hold numpy numeric dtypes.
This allows operations to create indexes with lower bit sizes (e.g. 16-bit indexes).
Index set operations Index.union(), Index.intersection(), Index.difference(), and Index.symmetric_difference() now support sort=True.
Copy-on-Write (CoW).
This is a way to deal with inconsistencies in pandas indexing operations.
A 'copy' of a DataFrame means that modifications to the parent or child DataFrame (the copy) are not shared.
A 'view' of a DataFrame means that modifications affect both the parent and child DataFrames.
Previously, some pandas operations returned a copy, while others returned a view.
This led to unwanted and difficult to detect side effects.
With CoW child DataFrame/Series always behaves as view (i.e, no extra memory usage, a lazy copy) until we modify either the parent/child, at which point the child is converted to a copy (deferred memory use).
This ensures that pandas DataFrames/Series can only be modified directly rather than inheriting changes via a view dependency.
Thus pandas now issues warnings/errors for inplace updates a view dependency.
This provides a significant performance improvement compared to the regular execution.
Thus, accessing a single column of a DataFrame as a Series (e.g. df["col"]) now always returns a new object.
Copy-on-Write can be enabled through pd.set_option("mode.copy_on_write", True) or pd.options.mode.copy_on_write = True
The inplace and copy keywords will eventually be deprecated and then removed.
Non-nanosecond resolution in Timestamps.
date_range() and timedelta_range() now support a unit keyword ("s", "ms", "us", or "ns") to specify the desired resolution of the output index.
DatetimeIndex.as_unit() and TimedeltaIndex.as_unit() convert to different resolutions ("s", "ms", "us", or "ns").
DataFrame.to_json() now supports a mode keyword with supported inputs 'w' and 'a'.
Backwards incompatible API changes.
Construction with datetime64 or timedelta64 dtype with unsupported resolution.
Previously, Series or DataFrame constructed with a "datetime64" or "timedelta64" dtype with unsupported resolution (i.e. anything other than "ns", say dtype="datetime64[s]") results in a nanosecond dtype anyway dtype: datetime64[ns].
Now dtype="datetime64[s]" works as expected.
UTC and fixed-offset timezones default to standard-library tzinfo objects
Previously, the default tzinfo object used to represent UTC was pytz.UTC.
Now pandas defaults to datetime.timezone.utc.
Similarly, for timezones represent fixed UTC offsets, we use datetime.timezone objects instead of pytz.FixedOffset objects.
Empty DataFrames/Series will now default to have a RangeIndex
In the past, to_datetime() guessed the format for each element independently.
Now parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).
Pandas uses SQLAlchemy, which has also undergone a major update (version 2.0+), as a consquence the previous pandas syntax for performing SQL IO, particularly in DataFrame.to_sql and pd.read_sql (via pd.read_sql_query and pd.read_sql_table) is no longer compatible with the new SQLAlchemy syntax.
The upgrade to SQLAlchemy 2.0+ syntax is not backwards compatible.
When installing pandas using pip, sets of optional dependencies can also be installed by specifying extras.
pip install "pandas[performance, aws]>=2.0.0"
The available extras, found in the installation guide, are [all, performance, computation, fss, aws, gcp, excel, parquet, feather, hdf5, spss, postgresql, mysql, sql-other, html, xml, plot, output_formatting, clipboard, compression, test].
I went through the updates and collected some changes that seem they may be relevant to STARE.
Dependency Updates
Notes:
GeoPandas, PyGEOS and Shapely 2.0
geopandas.options.use_pygeos = True/False
USE_PYGEOS=0/1
.mp = MultiPoint
object you can no longer use operations such asfor part in mp:
,mp[1]
,len(mp)
orlist(mp)
etc.geoms
property ofmp
. For example,for part in mp.geoms:
,mp.geoms[1]
,len(mp.geoms)
orlist(mp.geoms)
etc.line = LineString
thenline_coords = np.asarray(line)
orline_coords = np.array(line.coords)
GeometryCollection
as ageneric
empty geometry object.empty()
method on a geometry object is deprecated.shapely.ops.cascaded_union
function is deprecated. Useshapely.ops.unary_union
instead.Pandas 2.0
object dtype
as required by numpy.dtype="string"
,dtype=pd.StringDtype()
or.astype("string")
to create a string-based column.f"{dtype}[pyarrow]"
, e.g."int64[pyarrow]"
for an integer column.dtype = pandas.ArrowDtype(pyarrow.int64)
None
)nan
to represent missing values, but becausenan
arenp.float64
any numeric column with missing values was converted tonp.float64
.None
, which preserves the column's data-type.Index
can now hold numpy numeric dtypes.Index
set operationsIndex.union()
,Index.intersection()
,Index.difference()
, andIndex.symmetric_difference()
now supportsort=True
.df["col"]
) now always returns a new object.pd.set_option("mode.copy_on_write", True)
orpd.options.mode.copy_on_write = True
inplace
andcopy
keywords will eventually be deprecated and then removed.date_range()
andtimedelta_range()
now support a unit keyword ("s", "ms", "us", or "ns") to specify the desired resolution of the output index.DatetimeIndex.as_unit()
andTimedeltaIndex.as_unit()
convert to different resolutions ("s", "ms", "us", or "ns").DataFrame.to_json()
now supports amode
keyword with supported inputs 'w' and 'a'.dtype="datetime64[s]"
) results in a nanosecond dtype anywaydtype: datetime64[ns]
.dtype="datetime64[s]"
works as expected.tzinfo
object used to represent UTC waspytz.UTC
.datetime.timezone.utc
.timezones
represent fixed UTC offsets, we usedatetime.timezone
objects instead ofpytz.FixedOffset objects
.RangeIndex
pytest (dev) 7.0.0
,python-dateutil 2.8.2
,matplotlib 3.6.1
,xarray 0.21.0
to_datetime()
guessed the format for each element independently.SQL IO
, particularly inDataFrame.to_sql
andpd.read_sql
(viapd.read_sql_query
andpd.read_sql_table
) is no longer compatible with the new SQLAlchemy syntax.pip
extras.pip install "pandas[performance, aws]>=2.0.0"
all
,performance
,computation
,fss
,aws
,gcp
,excel
,parquet
,feather
,hdf5
,spss
,postgresql
,mysql
,sql-other
,html
,xml
,plot
,output_formatting
,clipboard
,compression
,test
].