Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
821 stars 58 forks source link

Alternative CSV reader #589

Open Jolanrensen opened 7 months ago

Jolanrensen commented 7 months ago

should be investigated: https://github.com/doyaaaaaken/kotlin-csv

koperagen commented 7 months ago

I tried FastCSV and want to utilize it on JVM for performance that several times better than existing one and beats pandas too I assume you aim for KMP, so it's a different thing. Just a note to keep in mind

devcrocod commented 7 months ago

Keep in mind that you can always write your own interface and hide the platform implementation later

Jolanrensen commented 1 month ago

I've been experimenting with different implementations to find the fastest one in combination with DataFrame.

Each test has two versions of the implementation:

We test:

Small CSV: 65.4 kB (ops/s: Higher score is better) image

(s/op: Lower score is better) image

Large CSV: 857.7 MB (ops/s: Higher score is better) image

(s/op: Lower score is better) image

Jolanrensen commented 1 month ago

I now added Deephaven-csv:

(s/op: Lower is better)

Benchmark                                    Mode  Cnt   Score    Error  Units
CsvBenchmark.apacheCsvReader                   ss   10   0.007 ±  0.003   s/op
CsvBenchmark.apacheCsvReaderSequential         ss   10   0.008 ±  0.003   s/op
CsvBenchmark.deephavenCsvReader                ss   10   0.009 ±  0.011   s/op
CsvBenchmark.fastCsvReader                     ss   10   0.004 ±  0.001   s/op
CsvBenchmark.fastCsvReaderSequential           ss   10   0.004 ±  0.002   s/op
CsvBenchmark.kotlinCsvReader                   ss   10   0.008 ±  0.001   s/op
CsvBenchmark.kotlinCsvReaderSequential         ss   10   0.007 ±  0.001   s/op
LargeCsvBenchmark.apacheCsvReader              ss    5  72.809 ± 16.879   s/op
LargeCsvBenchmark.apacheCsvReaderSequential    ss    5  46.433 ± 39.409   s/op
LargeCsvBenchmark.deephavenCsvReader           ss    5  16.640 ±  6.664   s/op
LargeCsvBenchmark.fastCsvReader                ss    5  59.848 ± 22.986   s/op
LargeCsvBenchmark.fastCsvReaderSequential      ss    5  40.747 ±  4.598   s/op
LargeCsvBenchmark.kotlinCsvReader              ss    5  80.383 ± 15.870   s/op
LargeCsvBenchmark.kotlinCsvReaderSequential    ss    5  68.547 ± 20.748   s/op

Note: The deephaven integration might not be optimal yet:

Jolanrensen commented 2 weeks ago

Combining Deephaven with https://github.com/Kotlin/dataframe/pull/712 is very promising. Reading the large csv on the ColumnDataHolder branch with properly set-up deephaven reading yields the following results: image image

Doing the same on the master branch yields: image image

Both in terms of memory and performance, there's something to gain from using deephaven and primitive arrays, at least when it comes to reading csvs :)

Jolanrensen commented 1 week ago

Deephaven with normal arraylists (that support nulls this time) and new parsers:

image