JuliaPy / PythonCall.jl

Python and Julia in harmony.
https://juliapy.github.io/PythonCall.jl/stable/
MIT License
736 stars 61 forks source link

`DataFrame(::PyPandasDataFrame)` converts date & datetime to bytes #293

Open tomdstone opened 1 year ago

tomdstone commented 1 year ago

When doing some work involving dataframes in python via PythonCall, it seems like DataFrame(PyTable(p)) where p is a pandas data table converts the date and datetime columns into byte vectors. Is this issue related to the issue #265 with milliseconds vs microseconds, or due to a missing part of the DataFrame(::PyPandasDataFrame) implementation?

Here are a few minimal examples, in a conda environment with pandas.

using PythonCall
using Dates
using DataFrames

a = DataFrame(x = [now()])  # julia dataframe
b = pytable(a)              # pandas dataframe
c = PyTable(b)              # PyPandasDataFrame
d = DataFrame(c)            # julia dataframe again

This results in:

julia> c
1×1 PyPandasDataFrame
                        x
0 2023-04-13 14:36:13.939

julia> d
1×1 DataFrame
 Row │ x
     │ PyArray…
─────┼───────────────────────────────────
   1 │ UInt8[0xc0, 0x62, 0x31, 0x8c, 0x…

The same thing happens when initially defining b as a pandas dataframe, so the microsecond issue in #265 seems to not be the problem?

julia> b = pd.DataFrame([[dt.datetime.now()]])
Python DataFrame:
                           0
0 2023-04-13 14:46:57.940077

julia> c = PyTable(b)
1×1 PyPandasDataFrame       
                           0
0 2023-04-13 14:46:57.940077

julia> d = DataFrame(c)
1×1 DataFrame
 Row │ 0
     │ PyArray…
─────┼───────────────────────────────────
   1 │ UInt8[0xc8, 0xf9, 0xa5, 0x7d, 0x…
tomdstone commented 1 year ago

If you replace the now() with today() in the specific example I gave, it converts dates correctly, but in my use case where I had two date columns and two datetime columns, both the date and datetime columns are converted to byte vectors

cjdoris commented 1 year ago

Just looking at this now. It all comes down to the fact that numpy arrays containing numpy.datetime64 cannot be wrapped as a Julia array yet. Therefore it's falling back on converting each element, and because numpy.datetime64 actually implements the buffer protocol with itemtype=1byte, that gets converted to a PyArray{UInt8} as you're seeing.

So there's two things that could be fixed here:

cjdoris commented 1 year ago

(PS thanks for the report and the MWE)

hhaensel commented 1 year ago

Just stumbled across that problem and solved it this way:

function pytimestamp_to_datetime(t::PyArray)
    DateTime(1970) + Second(reinterpret(Int64, t)[1] / 1000000000)
end

function pytimestamp_to_datetime(v::AbstractVector{<:PyArray})
    pytimestamp_to_datetime.(v)
end

EDIT: This only holds for datetime64[ns], otherwise the factor is different. I just filed a tentative PR for a general solution

hhaensel commented 1 year ago

@cjdoris If I eliminate .values in line 61 of PyPandasDataFrame everything runs smoothly. Would that be an option or is there some performance issue or other issue that I am not aware of?

UPDATE: That doesn't work, columns will be vectors of Py objects

hhaensel commented 1 year ago

If you replace the now() with today() in the specific example I gave, it converts dates correctly, but in my use case where I had two date columns and two datetime columns, both the date and datetime columns are converted to byte vectors

@tomdstone Could you check if also your use case where you had difficulties with time is working with the PR in place? Otherwise, please share an MWE. I could try to cover your use case as well.

hhaensel commented 1 year ago

Just added support for timedelta and timedelta64 conversion to Dates.CompoundPeriod.

tomdstone commented 1 year ago

@hhaensel sorry, I won't have time to implement it due to how work is right now (we are in the middle of a move). I might follow up in a few months, but I definitely wouldn't hold your breath waiting for me. I appreciate the thought though!

hhaensel commented 1 year ago

Thanks for the reply. No need to hurry. I just filed the PR as I was in need of a solution for my use case.

hhaensel commented 1 year ago

@cjdoris I added automatic conversion from Periods/CompoundPeriods to numpy:timedelta64 as this supports nanoseconds as well. When I use pytimedeltatype, I can only go down to microseconds. Is that a good decision? It seems that pandas can convert the timedelta64 to a timestamp.timedelta that handles nanoseconds correctly. For conversion from timedelta to Julia I used Dates.CompoundPeriod in order to be type stable. Otherwise one could use Nanosecond, but that seemed a bit weired to me. I know you're working for version 1 to come out soon. I hope my proposal fits in there somehow.

hhaensel commented 1 year ago

Just want to share what's working now with the PR

julia> x = Py(Second(1))
Python: numpy.timedelta64(1,'s')

julia> pyconvert(Second, x)
1 second

julia> x = Py(Second(1) + Nanosecond(1))
Python: numpy.timedelta64(1000000001,'ns')

julia> y = pyconvert(Any, x)
1 second, 1 nanosecond

julia> typeof(y)
Dates.CompoundPeriod

julia> jdf = DataFrame(x = [now() + Second(rand(1:1000)) for _ in 1:100], y = [Second(n) for n in 1:100]);
julia> pdf = pytable(jdf)
Python:
                         x               y
0  2023-07-04 11:29:27.781 0 days 00:00:01
1  2023-07-04 11:30:02.781 0 days 00:00:02
2  2023-07-04 11:40:17.781 0 days 00:00:03
3  2023-07-04 11:31:11.781 0 days 00:00:04
           ... 5 more lines ...
98 2023-07-04 11:37:21.781 0 days 00:01:39
99 2023-07-04 11:35:53.781 0 days 00:01:40

[100 rows x 2 columns]

julia> jdf2 = DataFrame(PyTable(pdf))
100×2 DataFrame
 Row │ x                        y
     │ DateTime                 Compound…   
─────┼──────────────────────────────────────
   1 │ 2023-07-04T11:29:27.781  1 second
   2 │ 2023-07-04T11:30:02.781  2 seconds
   3 │ 2023-07-04T11:40:17.781  3 seconds
  ⋮  │            ⋮                  ⋮
  99 │ 2023-07-04T11:37:21.781  99 seconds
 100 │ 2023-07-04T11:35:53.781  100 seconds
                             95 rows omitted
github-actions[bot] commented 11 months ago

This issue has been marked as stale because it has been open for 30 days with no activity. If the issue is still relevant then please leave a comment, or else it will be closed in 7 days.

hhaensel commented 11 months ago

@cjdoris I think this is still relevant

klwlevy commented 11 months ago

@cjdoris I think this is still relevant

I agree. Just stumbled on this when using pandas functionality to convert (legacy) xls data to xlsx. The dates became PyArray{UInt8}.

hhaensel commented 1 month ago

I'd be ready to adapt the my PR to the latest changes of PythonCall. Should I continue to work on it? Is the refactoring complete or should I still wait a bit?

cjdoris commented 1 month ago

Yep the refactor is done. Can you clarify what the PR will change?

hhaensel commented 1 month ago

I think I summarised everything quite nicely above (https://github.com/JuliaPy/PythonCall.jl/issues/293#issuecomment-1619897477)

The related PR is #334, where you commented that you are refactoring.

hhaensel commented 1 month ago

@cjdoris After refactoring I decided that it is easier to submit a new PR #509