Open tomdstone opened 1 year ago
If you replace the now()
with today()
in the specific example I gave, it converts dates correctly, but in my use case where I had two date columns and two datetime columns, both the date and datetime columns are converted to byte vectors
Just looking at this now. It all comes down to the fact that numpy arrays containing numpy.datetime64
cannot be wrapped as a Julia array yet. Therefore it's falling back on converting each element, and because numpy.datetime64
actually implements the buffer protocol with itemtype=1byte, that gets converted to a PyArray{UInt8}
as you're seeing.
So there's two things that could be fixed here:
datetime64
datetime64
to something more useful than PyArray{UInt8}
.(PS thanks for the report and the MWE)
Just stumbled across that problem and solved it this way:
function pytimestamp_to_datetime(t::PyArray)
DateTime(1970) + Second(reinterpret(Int64, t)[1] / 1000000000)
end
function pytimestamp_to_datetime(v::AbstractVector{<:PyArray})
pytimestamp_to_datetime.(v)
end
EDIT: This only holds for datetime64[ns], otherwise the factor is different. I just filed a tentative PR for a general solution
@cjdoris
If I eliminate .values
in line 61 of PyPandasDataFrame everything runs smoothly. Would that be an option or is there some performance issue or other issue that I am not aware of?
UPDATE: That doesn't work, columns will be vectors of Py objects
If you replace the
now()
withtoday()
in the specific example I gave, it converts dates correctly, but in my use case where I had two date columns and two datetime columns, both the date and datetime columns are converted to byte vectors
@tomdstone Could you check if also your use case where you had difficulties with time is working with the PR in place? Otherwise, please share an MWE. I could try to cover your use case as well.
Just added support for timedelta and timedelta64 conversion to Dates.CompoundPeriod.
@hhaensel sorry, I won't have time to implement it due to how work is right now (we are in the middle of a move). I might follow up in a few months, but I definitely wouldn't hold your breath waiting for me. I appreciate the thought though!
Thanks for the reply. No need to hurry. I just filed the PR as I was in need of a solution for my use case.
@cjdoris I added automatic conversion from Periods/CompoundPeriods to numpy:timedelta64 as this supports nanoseconds as well. When I use pytimedeltatype, I can only go down to microseconds. Is that a good decision? It seems that pandas can convert the timedelta64 to a timestamp.timedelta that handles nanoseconds correctly. For conversion from timedelta to Julia I used Dates.CompoundPeriod in order to be type stable. Otherwise one could use Nanosecond, but that seemed a bit weired to me. I know you're working for version 1 to come out soon. I hope my proposal fits in there somehow.
Just want to share what's working now with the PR
julia> x = Py(Second(1))
Python: numpy.timedelta64(1,'s')
julia> pyconvert(Second, x)
1 second
julia> x = Py(Second(1) + Nanosecond(1))
Python: numpy.timedelta64(1000000001,'ns')
julia> y = pyconvert(Any, x)
1 second, 1 nanosecond
julia> typeof(y)
Dates.CompoundPeriod
julia> jdf = DataFrame(x = [now() + Second(rand(1:1000)) for _ in 1:100], y = [Second(n) for n in 1:100]);
julia> pdf = pytable(jdf)
Python:
x y
0 2023-07-04 11:29:27.781 0 days 00:00:01
1 2023-07-04 11:30:02.781 0 days 00:00:02
2 2023-07-04 11:40:17.781 0 days 00:00:03
3 2023-07-04 11:31:11.781 0 days 00:00:04
... 5 more lines ...
98 2023-07-04 11:37:21.781 0 days 00:01:39
99 2023-07-04 11:35:53.781 0 days 00:01:40
[100 rows x 2 columns]
julia> jdf2 = DataFrame(PyTable(pdf))
100×2 DataFrame
Row │ x y
│ DateTime Compound…
─────┼──────────────────────────────────────
1 │ 2023-07-04T11:29:27.781 1 second
2 │ 2023-07-04T11:30:02.781 2 seconds
3 │ 2023-07-04T11:40:17.781 3 seconds
⋮ │ ⋮ ⋮
99 │ 2023-07-04T11:37:21.781 99 seconds
100 │ 2023-07-04T11:35:53.781 100 seconds
95 rows omitted
This issue has been marked as stale because it has been open for 30 days with no activity. If the issue is still relevant then please leave a comment, or else it will be closed in 7 days.
@cjdoris I think this is still relevant
@cjdoris I think this is still relevant
I agree. Just stumbled on this when using pandas functionality to convert (legacy) xls data to xlsx. The dates became PyArray{UInt8}.
I'd be ready to adapt the my PR to the latest changes of PythonCall. Should I continue to work on it? Is the refactoring complete or should I still wait a bit?
Yep the refactor is done. Can you clarify what the PR will change?
I think I summarised everything quite nicely above (https://github.com/JuliaPy/PythonCall.jl/issues/293#issuecomment-1619897477)
The related PR is #334, where you commented that you are refactoring.
@cjdoris After refactoring I decided that it is easier to submit a new PR #509
When doing some work involving dataframes in python via PythonCall, it seems like
DataFrame(PyTable(p))
wherep
is a pandas data table converts the date and datetime columns into byte vectors. Is this issue related to the issue #265 with milliseconds vs microseconds, or due to a missing part of theDataFrame(::PyPandasDataFrame)
implementation?Here are a few minimal examples, in a conda environment with pandas.
This results in:
The same thing happens when initially defining
b
as a pandas dataframe, so the microsecond issue in #265 seems to not be the problem?