Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
845 stars 63 forks source link

Reading CSV with custom nullString impossible #921

Open Jolanrensen opened 1 month ago

Jolanrensen commented 1 month ago

Umbrella'd under https://github.com/Kotlin/dataframe/issues/827

Reported on slack: https://kotlinlang.slack.com/archives/C4W52CFEZ/p1728885330465379

The CSV https://kotlinlang.slack.com/files/U16CM33AB/F07R98VJ7AT/msleep.csv contains several columns of Double values and "NA"s, representing null. This causes some curious cases:

Expected Actual
DataFrame.readCSV() should be able to recognise "NA" means null and parse the column as Double? The column brainwt is parsed as BigDecimal because it doesn't recognize "3e-04" as Double and doesn't handle NA well.
DataFrame.readCSV("NA" in nullStrings) should help recognizing "NA" as null. Recognizes "NA" as null but result is still BigDecimal?
"NA" in nullStrings and colTypes = "brainwt" to ColType.Double should work for sure "java.lang.IllegalStateException: Couldn't parse 'NA' into type kotlin.Double". Apparently giving a colType grabs the Double parser directly and does not take nullStrings into account. Plus, if the result is null it's assumed the parsing failed. We need to give ColType.String and call parse or convert afterwards manually.
parse() and convert().toDouble() should behave the same parse() uses NumberFormat with locale and doesn't recognize "3e-04" . convert using Double.parseDouble() without locale and can parse it.

Most of the issues here are solved by the new CSV implementation under the umbrella issue: https://github.com/Kotlin/dataframe/issues/827. The case for "3e-04" requires a different Double parser, which is solved by https://github.com/Kotlin/dataframe/pull/935.