JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.72k stars 367 forks source link

Formalize API for column vectors #2567

Open pdeffebach opened 3 years ago

pdeffebach commented 3 years ago

I feel like the question about datframes with distributed arrays comes up a lot. My impression is that we don't know, for sure, if a Dagger array etc. can "just work" as a column in a DataFrame.

I think I might try to write a custom vector type and then put it in a data frame and see how many functions I can call for it before it becomes a normal vector. Then we can assess to what extent DataFrames can support Dask-like operations just by changing the vector type.

quinnj commented 3 years ago

This is a great idea; in particular, it would be great to document which functions/methods are expected to work along with how they're used in DataFrames in different operations. Happy to help with this effort.

bkamins commented 3 years ago

The first candidates that would break are fast aggregations like combine(gdf, :x => sum). In general - all cases when DataFrames.jl "internally" creates a column it is likely to assume that it is a "standard" vector. Similarly in many operations DataFrames.jl internally creates Vectors for processing data (see e.g. at GroupedDataFrame struct definition).

Having said that I think it should be doable to add "distributed" support to DataFrames.jl in the long run. However, probably we would need to have some API that would communicate to DataFrames.jl how distribution is performed (as if you have distributed vectors most likely you want to process them in a way that takes this into account).

pdeffebach commented 3 years ago

Yeah I have no idea how distributed computing works, or threading for that matter. Still I will put this on the to-do list for winter break / procrastination from school.

nalimilan commented 3 years ago

Somewhat related is whether we preserve the container types of input columns: https://github.com/JuliaData/DataFrames.jl/issues/2569

I don't think DataFrames has very specific requirements for columns: apart from the issue of one-based indexing, which we should investigate if somebody cares, things should work as long as the AbstractArray interface is implemented. It probably won't be fast for distributed arrays, though, since we use for i in eachindex(col) loops a lot.