Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
799 stars 55 forks source link

large datasets - how to handle them? #141

Open nort3x opened 2 years ago

nort3x commented 2 years ago

i read docs couldn't really find anything clarifying this, although title in readme file pretty much says "in-memory"

suppose i have a 6gb structured dataset, in pandas for instance i can specify chunksize what is equivalent idiomatic way in dataframe? ( i see readLine parameter in read methods but I'm not sure)

koperagen commented 2 years ago

Hi! I think nobody tried it before, so there is no idiomatic way to do it right now. What's the format of your dataset, CSV?

nort3x commented 2 years ago

part 1: description of the problem

in my case i'm using a postgres database as my initial dataset, i first saved result of a huge hierarchical query in disk using json_agg and recursive query to end up with a huge json file which is saved to file using copy ... to syntax

i even later wrote this naive implementation so if i could find a way to stream data i can directly parse from sql resultset:

fun ResultSet.toDataFrame(): DataFrame<*>{
    val cols = with(metaData) {
        (1 .. columnCount)
            .map { getColumnName(it) }
            .toList()
    }

    val resultList = sequence {
        while (this@toDataFrame.next()) {
            val mapRepoOfObject = mutableMapOf<String,Any>()
            cols.forEach() {
                mapRepoOfObject[it] = getObject(it)
            }
            yield(mapRepoOfObject)
        }
    }.toList()

    return dataFrameOf(cols){ header ->
        resultList.map { it[header]!! }.toList()
    }
}

// use case

val con: Connection =  DriverManager.getConnection("jdbc:postgresql://blahblah/")!!
val df = con.createStatement().executeQuery("select * from huge").toDataFrame().also { println(it) }

if i use limit and pagination i can produce stream of dataframes, with the same procedure i can also end up with the same stream of dataframes from file

part 2: so what's the problem?

for instance have groupby or any aggregate in mind, processing multiple dataframes in separate fashion can result in duplicates?

so basically i'm inspecting if we can use DataFrame for larger datasets some how

koperagen commented 2 years ago

Hello again. I've found some interesting article on groupBy / aggregate for chunked dataframes in pandas https://maxhalford.github.io/blog/pandas-streaming-groupby/. The general idea is that if you need a groping by some characteristic, sorting a dataset by this characteristic could help to process your data in chunks and then merge the result. Let me know if you think it will work for you

corlaez commented 1 year ago

I have the same concern about large dataset but a different scenario and desired solution.

I am not too interested in manipulating the dataframe as a whole but rather reading row by row (step by step), in a lazy fashion (again because my data is huge files say 8GB, 20 GB). The formats I am interested in are XML, CSV.

The solution would be something like a lazy dataframe that allows me to read row by row without loading the whole file in memory.

nort3x commented 1 year ago

@corlaez some operations by nature require traverse of whole dataset (like transformation, count , etc ...) and some operations even require auxiliary storage (like grouping, distinct, ... ) the answer i've found is dataframe is for researching and testing on a small subset of data for anything extra we should use the specific tool made and optimized to be used as a pipeline on big data (like sparks and hadoop (i'm not qualified to give advice on this )) as the result i like dataframe as is and i don't expect it to grow complex to support such usecases although it would be very beneficial if dataframe team decide to publish extra modules to interpolate with those solutions to have a seamless experience.

zaleslaw commented 2 months ago

It was also asked here, on the Reddit https://www.reddit.com/r/Kotlin/s/ctNLZEq2gI

Also it should be provided a documentation page for user with different option and ways, how to handle large CSV files, what could be fine-tuned

Benchmark for some synthetic CSV is also worth to add