JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.71k stars 360 forks source link

Column types seem to change after inner-join between two DataFrames #3442

Open KevinG1002 opened 1 month ago

KevinG1002 commented 1 month ago

Hello,

Thank you for putting this package together. It has helped a lot.

I am working with time series dataframes and I've noticed that when performing join-operations with dataframes the type associated with some of the columns seem to change.

Here's an example:

I have two dataframes, each with two columns. The first one is a "date" column whose entries are Date values, the second one is the value-column of type Float64 where I get the value of the timeseries. In my example, I am looking to perform on inner-join between quarterly GDP and quarterly metal-usage, by joining on the "date" column.

The inner-join statement I use is: X_df = innerjoin(metal_usage_df, global_gdp_df, on = :date)

The GDP dataframe looks like: date (type Date) GDP (type Float64)
2020-10-01 22024.5
2021-01-01 22600.2
2021-04-01 23292.4
and it is inner-joined with the metal-usage DF, which looks like: date (type Date) metal-usage (type Float64)
2020-10-01 222.6
2021-01-01 212.1
2021-04-01 239.5
However, when printing out the inner-join df that I get, the GDP column now has a different type: date (type Date) metal-usage (type Float64) GDP (type Any)
2020-10-01 222.6 22024.5
2021-01-01 212.1 22600.2
2021-04-01 239.5 23292.4

and this causes downstream issues for me. I was wondering what the root cause was for this and if there was a way for me to enforce column types during or before the inner-join operation?

Any help would be much appreciated!

bkamins commented 1 month ago

Could you share a code alowing to reproduce the problem? When I run your example on sample data there are no such issues:

julia> metal_usage_df = DataFrame(date=1:3, metal=[1.5, 2.5, 3.5])
3×2 DataFrame
 Row │ date   metal
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.5
   2 │     2      2.5
   3 │     3      3.5

julia> global_gdp_df = DataFrame(date=1:3, GDP=[21.5, 22.5, 23.5])
3×2 DataFrame
 Row │ date   GDP
     │ Int64  Float64
─────┼────────────────
   1 │     1     21.5
   2 │     2     22.5
   3 │     3     23.5

julia> X_df = innerjoin(metal_usage_df, global_gdp_df, on = :date)
3×3 DataFrame
 Row │ date   metal    GDP
     │ Int64  Float64  Float64
─────┼─────────────────────────
   1 │     1      1.5     21.5
   2 │     2      2.5     22.5
   3 │     3      3.5     23.5