Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
849 stars 64 forks source link

Casting strings to double using `with { it.toDouble()}` and `toDouble()` gives different results #568

Open devcrocod opened 10 months ago

devcrocod commented 10 months ago

Reproduce

  1. Take the ramen dataset: https://www.kaggle.com/code/sujan97/complete-analysis-of-ramen-ratings/input
  2. val df = DataFrame.readCSV("ramen-ratings.csv").renameToCamelCase()
    df.filter { !stars.startsWith("Un") }.convert { stars }.toDouble()
  3. convert stars column to a double type
    df.filter { !stars.startsWith("Un") }.convert { stars }.toDouble()

Expected

df.filter { !stars.startsWith("Un") }.convert { stars }.with { it.toDouble() }

result:

   review#          brand                                  variety style country stars topTen
 0    2580      New Touch                 T's Restaurant Tantanmen   Cup   Japan  3,75   null
 1    2579       Just Way Noodles Spicy Hot Sesame Spicy Hot Se...  Pack  Taiwan  1,00   null
 2    2578         Nissin            Cup Noodles Chicken Vegetable   Cup     USA  2,25   null
 3    2577        Wei Lih            GGE Ramen Snack Tomato Flavor  Pack  Taiwan  2,75   null
 4    2576 Ching's Secret                          Singapore Curry  Pack   India  3,75   null

Actual

   review#          brand                                  variety style country stars topTen
 0    2580      New Touch                 T's Restaurant Tantanmen   Cup   Japan 375,0   null
 1    2579       Just Way Noodles Spicy Hot Sesame Spicy Hot Se...  Pack  Taiwan   1,0   null
 2    2578         Nissin            Cup Noodles Chicken Vegetable   Cup     USA 225,0   null
 3    2577        Wei Lih            GGE Ramen Snack Tomato Flavor  Pack  Taiwan 275,0   null
 4    2576 Ching's Secret                          Singapore Curry  Pack   India 375,0   null

Version and Environment

Name: kotlin-jupyter-kernel, Version: 0.11.0.385

dataframe version: 0.12.1

zaleslaw commented 10 months ago

Thanks @devcrocod

Jolanrensen commented 10 months ago

I'm sorry, I cannot reproduce it directly. It returns the same result for me.

It might be a locale thing (as I see your Doubles have "," instead "."). Convert relies on parse to parse Strings. It defaults to your system locale and interprets "," as the decimal splitter and "." as the thousands splitter.

This may be different from the default String.toDouble() function from the stdlib you call the other time. I feel like this is intended behavior, though a bit unfortunate in this example.

Since you're trying to parse a String I'd recommend using parse as you can define extra ParserOptions, such as a Locale.

devcrocod commented 10 months ago

Yes, this is a problem specifically with the locale. But I expect to get one result: df.filter { !stars.startsWith("Un") }.convert { stars }.with { it.toDouble() }, df.filter { !stars.startsWith("Un") }.convert { stars }.toDouble()

Because in my opinion, toDouble() is just a shortcut for with.

Jolanrensen commented 10 months ago

Yes, this is a problem specifically with the locale. But I expect to get one result: df.filter { !stars.startsWith("Un") }.convert { stars }.with { it.toDouble() }, df.filter { !stars.startsWith("Un") }.convert { stars }.toDouble()

Because in my opinion, toDouble() is just a shortcut for with.

I know, it should, but I'd argue our solution is "better" as it takes locale into account. It's the stlib toDouble() function that should change, but that's not something we can do.