[ ] This PR addresses an already opened issue (for bug fixes / features)
This PR fixes #xyz
[x] (If applicable) Documentation has been added / updated (for bug fixes / features).
[x] (If applicable) Tests have been added.
[x] This PR does not seem to break the templates.
[x] HISTORY.rst has been updated (with summary of main changes).
[x] Link to issue (:issue:number) and pull request (:pull:number) has been added.
What kind of change does this PR introduce?
The date_start and date_end columns are casted with a datetime64[ms] dtype (not a Period)
Improvements to date_parser.
Rewrite of subset_file_coverage.
Removal of driving_institution as an official xscen column.
pin of pandas >= 2
Pandas 2 now supports datetime columns with a s, ms and us resolution, instead of the old ns default. This allows storing dates from before 1677 and after 2242. However, this support is still partial as many of the datetime manipulation methods will still fail on "out of bounds" dates. This includes: pd.read_csv and pd.to_datetime... Because of this bug, I had to implement the parsing directly in the DataCatalog's init, using a solution proposed on stackoverflow.
Even with this strange workaround, opening simulation.json went from 3 s to 800 ms on my machine !
The change had repercussions in other parts of xscen, especially date_parser and subset_file_coverage. I adapted the former to output pd.Timestamp objects by default and the latter to use more of the Interval magic pandas can already do with datetime bounds.
I also used this PR to remove driving_institution from the official columns, as discussed.
Does this PR introduce a breaking change?
The default output of date_parser has changed.
The default dtype of date_start and date_end has changed.
The driving_institution column has been removed.
Other information:
This required pinning pandas >= 2, clisops >= 0.10. The latter pin allowed unpinning python.
Pull Request Checklist:
number
) and pull request (:pull:number
) has been added.What kind of change does this PR introduce?
date_start
anddate_end
columns are casted with adatetime64[ms]
dtype (not a Period)date_parser
.subset_file_coverage
.driving_institution
as an official xscen column.Pandas 2 now supports datetime columns with a s, ms and us resolution, instead of the old ns default. This allows storing dates from before 1677 and after 2242. However, this support is still partial as many of the datetime manipulation methods will still fail on "out of bounds" dates. This includes:
pd.read_csv
andpd.to_datetime
... Because of this bug, I had to implement the parsing directly in theDataCatalog
's init, using a solution proposed on stackoverflow.Even with this strange workaround, opening
simulation.json
went from 3 s to 800 ms on my machine !The change had repercussions in other parts of xscen, especially
date_parser
andsubset_file_coverage
. I adapted the former to outputpd.Timestamp
objects by default and the latter to use more of theInterval
magic pandas can already do with datetime bounds.I also used this PR to remove
driving_institution
from the official columns, as discussed.Does this PR introduce a breaking change?
The default output of
date_parser
has changed.The default dtype of
date_start
anddate_end
has changed.The
driving_institution
column has been removed.Other information:
This required pinning pandas >= 2, clisops >= 0.10. The latter pin allowed unpinning python.