Open Jolanrensen opened 7 months ago
I tried FastCSV and want to utilize it on JVM for performance that several times better than existing one and beats pandas too I assume you aim for KMP, so it's a different thing. Just a note to keep in mind
Keep in mind that you can always write your own interface and hide the platform implementation later
I've been experimenting with different implementations to find the fastest one in combination with DataFrame.
Each test has two versions of the implementation:
List<SomeCsvRowClass>
, saving memory in the long run :)We test:
Small CSV: 65.4 kB (ops/s: Higher score is better)
(s/op: Lower score is better)
Large CSV: 857.7 MB (ops/s: Higher score is better)
(s/op: Lower score is better)
I now added Deephaven-csv:
(s/op: Lower is better)
Benchmark Mode Cnt Score Error Units
CsvBenchmark.apacheCsvReader ss 10 0.007 ± 0.003 s/op
CsvBenchmark.apacheCsvReaderSequential ss 10 0.008 ± 0.003 s/op
CsvBenchmark.deephavenCsvReader ss 10 0.009 ± 0.011 s/op
CsvBenchmark.fastCsvReader ss 10 0.004 ± 0.001 s/op
CsvBenchmark.fastCsvReaderSequential ss 10 0.004 ± 0.002 s/op
CsvBenchmark.kotlinCsvReader ss 10 0.008 ± 0.001 s/op
CsvBenchmark.kotlinCsvReaderSequential ss 10 0.007 ± 0.001 s/op
LargeCsvBenchmark.apacheCsvReader ss 5 72.809 ± 16.879 s/op
LargeCsvBenchmark.apacheCsvReaderSequential ss 5 46.433 ± 39.409 s/op
LargeCsvBenchmark.deephavenCsvReader ss 5 16.640 ± 6.664 s/op
LargeCsvBenchmark.fastCsvReader ss 5 59.848 ± 22.986 s/op
LargeCsvBenchmark.fastCsvReaderSequential ss 5 40.747 ± 4.598 s/op
LargeCsvBenchmark.kotlinCsvReader ss 5 80.383 ± 15.870 s/op
LargeCsvBenchmark.kotlinCsvReaderSequential ss 5 68.547 ± 20.748 s/op
Note: The deephaven integration might not be optimal yet:
Combining Deephaven with https://github.com/Kotlin/dataframe/pull/712 is very promising. Reading the large csv on the ColumnDataHolder branch with properly set-up deephaven reading yields the following results:
Doing the same on the master branch yields:
Both in terms of memory and performance, there's something to gain from using deephaven and primitive arrays, at least when it comes to reading csvs :)
Deephaven with normal arraylists (that support nulls this time) and new parsers:
should be investigated: https://github.com/doyaaaaaken/kotlin-csv