DataFrame.readCSV() should be able to recognise "NA" means null and parse the column as Double?
The column brainwt is parsed as BigDecimal because it doesn't recognize "3e-04" as Double and doesn't handle NA well.
DataFrame.readCSV("NA" in nullStrings) should help recognizing "NA" as null.
Recognizes "NA" as null but result is still BigDecimal?
"NA" in nullStrings and colTypes = "brainwt" to ColType.Double should work for sure
"java.lang.IllegalStateException: Couldn't parse 'NA' into type kotlin.Double". Apparently giving a colType grabs the Double parser directly and does not take nullStrings into account. Plus, if the result is null it's assumed the parsing failed. We need to give ColType.String and call parse or convert afterwards manually.
parse() and convert().toDouble() should behave the same
parse() uses NumberFormat with locale and doesn't recognize "3e-04" . convert using Double.parseDouble() without locale and can parse it.
Umbrella'd under https://github.com/Kotlin/dataframe/issues/827
Reported on slack: https://kotlinlang.slack.com/archives/C4W52CFEZ/p1728885330465379
The CSV https://kotlinlang.slack.com/files/U16CM33AB/F07R98VJ7AT/msleep.csv contains several columns of Double values and "NA"s, representing
null
. This causes some curious cases:DataFrame.readCSV()
should be able to recognise "NA" meansnull
and parse the column asDouble?
brainwt
is parsed asBigDecimal
because it doesn't recognize "3e-04" as Double and doesn't handleNA
well.DataFrame.readCSV("NA" in nullStrings)
should help recognizing "NA" asnull
.null
but result is stillBigDecimal?
"NA" in nullStrings
andcolTypes = "brainwt" to ColType.Double
should work for sure"java.lang.IllegalStateException: Couldn't parse 'NA' into type kotlin.Double"
. Apparently giving a colType grabs the Double parser directly and does not take nullStrings into account. Plus, if the result isnull
it's assumed the parsing failed. We need to giveColType.String
and call parse or convert afterwards manually.parse()
andconvert().toDouble()
should behave the sameparse()
usesNumberFormat
with locale and doesn't recognize "3e-04" .convert
usingDouble.parseDouble()
without locale and can parse it.Most of the issues here are solved by the new CSV implementation under the umbrella issue: https://github.com/Kotlin/dataframe/issues/827. The case for "3e-04" requires a different Double parser, which is solved by https://github.com/Kotlin/dataframe/pull/935.