JuliaPy / PythonCall.jl

Python and Julia in harmony.
https://juliapy.github.io/PythonCall.jl/stable/
MIT License
712 stars 61 forks source link

Pandas compatibility #501

Open MilesCranmer opened 1 month ago

MilesCranmer commented 1 month ago

Affects: PythonCall

Describe the bug

I have been trying to use pandas from PythonCall.jl and just wanted to document a few different calls that do not directly translate to Julia. I guess this might just mean we need a PythonPandas package to translate calls but I wonder if there's any missing methods that could be implemented to fix things automatically.

First, the preamble for this:

using PythonCall

pd = pyimport("pandas")

Using a similar syntax to Python:

df = pd.DataFrame(Dict([
    "a" => [1, 2, 3],
    "b" => [4, 5, 6]
]))

which results in the following dataframe:

julia> df
Python:
   0
0  b
1  a

i.e., it seems to have a single column named "0" and rows for a and b.

If I instead write this as a vector of pairs, I get:

julia> pd.DataFrame([
           "a" => [1, 2, 3],
           "b" => [4, 5, 6]
       ])
Python:
   0          1
0  a  [1, 2, 3]
1  b  [4, 5, 6]

I suppose this one makes sense.

I was able to get it working with the following syntax instead:

julia> df = pd.DataFrame([
            1   4
            2   5
            3   6
       ], columns=["a", "b"])
Python:
   a  b
0  1  4
1  2  5
2  3  6

So, selecting a single column works:

julia> df["a"]
Python:
0    1
1    2
2    3
Name: a, dtype: int64

but multiple columns does not:

julia> df[["a", "b"]]
ERROR: Python: TypeError: Julia: MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.
Python stacktrace:
 [1] __call__
   @ ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:223
 [2] apply_if_callable
   @ pandas.core.common ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/common.py:384
 [3] __getitem__
   @ pandas.core.frame ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/frame.py:4065
Stacktrace:
 [1] pythrow()
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:92
 [2] errcheck
   @ ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:10 [inlined]
 [3] pygetitem(x::Py, k::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/builtins.jl:171
 [4] getindex(x::Py, i::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/Py.jl:292
 [5] top-level scope
   @ REPL[18]:1

I got around this by inserting a pylist call:

julia> df[pylist(["a", "b"])]
Python:
   a  b
0  1  4
1  2  5
2  3  6
mrkn commented 1 month ago

As you can see in the document, AbstractArray and AbstractDict are implicitly converted to wrapper objects on the Python call.

In the first case, you should use pydict function to convert a Julia's Dict to a Python's dict.

julia> df = pd.DataFrame(pydict(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])))
Python:
   b  a
0  4  1
1  5  2
2  6  3

As in the first case, the necessity of the explicit call to the pylist function is required in the second case.

MilesCranmer commented 1 month ago

Thanks, that makes sense! I didn’t see pydict.

So should this be closed or is there anything that can be done automatically?

cjdoris commented 1 month ago

The issue is that pandas.DataFrame.__init__ explicitly checks if its argument is a dict and Py(::Dict) is not a dict (it's a juliacall.DictValue). The two options to make this work automatically are:

cjdoris commented 1 month ago

I think requiring pylist to do the indexing is a similar issue - it checks for list rather than the more general abc.collections.Sequence, which includes both list and juliacall.VectorValue.

MilesCranmer commented 1 month ago

I think the solutions on pandas side sound like better options to me. I'm not sure if they have some edge cases which prevent them being more general... Like maybe some abc.collections.Sequence acting as a single key?

MilesCranmer commented 1 month ago

cross-posted here: https://github.com/pandas-dev/pandas/issues/58803