Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
821 stars 58 forks source link

Replace Klaxon with kotlinx-serialization #603

Closed devcrocod closed 3 months ago

devcrocod commented 7 months ago

This is the first step in migrating from klaxon to kotlinx-serialization #312

This PR should be accepted after #574, after which I will resolve the conflicts.

Jolanrensen commented 7 months ago

What do you mean with "specially handle the float case". Looking at your code I see you skip it and convert it to a Double sometimes. I don't really understand why; DataFrame understands Floats just fine, so you can keep them as Floats

zaleslaw commented 7 months ago

@devcrocod you wrote that this is a first step, what could be the next step?

devcrocod commented 7 months ago

@devcrocod you wrote that this is a first step, what could be the next step?

Check related issue

devcrocod commented 7 months ago

What do you mean with "specially handle the float case". Looking at your code I see you skip it and convert it to a Double sometimes. I don't really understand why; DataFrame understands Floats just fine, so you can keep them as Floats

By default, klaxon translates all floating-point numbers into Double. kotlinx-serialization can work with float. So, the question here is not what DataFrame can do, but what klaxon could do. It was possible to support Float in JSON parsing, but then it would have been necessary to additionally change the logic in JSON parsing. I decided not to do this because these subsequent parsers should ultimately change, and because it maximally preserves the previous behavior of the library.

Jolanrensen commented 7 months ago

What do you mean with "specially handle the float case". Looking at your code I see you skip it and convert it to a Double sometimes. I don't really understand why; DataFrame understands Floats just fine, so you can keep them as Floats

By default, klaxon translates all floating-point numbers into Double. kotlinx-serialization can work with float. So, the question here is not what DataFrame can do, but what klaxon could do. It was possible to support Float in JSON parsing, but then it would have been necessary to additionally change the logic in JSON parsing. I decided not to do this because these subsequent parsers should ultimately change, and because it maximally preserves the previous behavior of the library.

I think the previous behavior of the library was not intended, but rather a limit imposed on the library by Klaxon. It seems odd to keep this broken behavior if both Kotlinx-serialization ánd DataFrame support Floats to convert them to Doubles. So yes, this might require updating some tests and logic, but it's ultimately for the best I believe.

devcrocod commented 7 months ago

I agree that such behavior is not entirely correct, as it also carries computational errors when converting float to double. But I will repeat that this behavior will be changed in the future when serializers for dataframe objects are written.

Moreover, in the current version, there is a question: During serialization, it's obvious how it should look:

1.1f -> 1.1 (float)
1.1 -> 1.100000 (double)

But during deserialization, we look at each element, and the logic is as follows: if the number fits within 32, then it's a float; if it's more than 32, then it's a double. And it turns out that our list can easily contain both Float and Double, and instead of the output type being Double, we get Number.

Jolanrensen commented 3 months ago

@devcrocod Looks like you introduced 4 failing tests in org.jetbrains.dataframe.ksp.DataFrameSymbolProcessorTest

devcrocod commented 3 months ago

I made some minor corrections and a bit of refactoring. I also fixed issues with the plugin tests. The problem was that we set a rateLimit (10000) on the incoming input stream. When reading, we check if the selected format is appropriate; if not, we reset the input stream. The issue was that the JSON format was being predicted among the first, and kotlinx-serialization reads all bytes from the incoming stream, ignoring our set limit. Consequently, when we perform a reset, it always results in an error because more than the limit was read.

devcrocod commented 3 months ago

This is strange, I got errors with DataFrameBlackBoxCodegenTestGenerated after today's merge

Jolanrensen commented 3 months ago

This is strange, I got errors with DataFrameBlackBoxCodegenTestGenerated after today's merge

It's because https://github.com/Kotlin/dataframe/pull/729 got merged. Looks like the test failed because your API change changed the FIR representation in the plugin. This will probably happen a lot in the future but is actually solvable and just a warning really :)

Simply:

devcrocod commented 3 months ago

I did as you said, and locally everything worked for me. But as you can see, the tests on TC still failed:

org.opentest4j.AssertionFailedError: Actual data differs from file content: toDataFrame_dsl.fir.ir.txt

Could these files be platform or environment dependent? In any case, this approach of regenerating files for tests doesn't look as perfect decision

Jolanrensen commented 3 months ago

I did as you said, and locally everything worked for me. But as you can see, the tests on TC still failed:

org.opentest4j.AssertionFailedError: Actual data differs from file content: toDataFrame_dsl.fir.ir.txt

Could these files be platform or environment dependent? In any case, this approach of regenerating files for tests doesn't look as perfect decision

Try setting your project sdk + gradle's jvm to JDK 11, while we figure out how to make that the default for all contributors :)

Edit: actually, I fixed it in https://github.com/Kotlin/dataframe/pull/736, I'll update your branch from master and we should be good to go :)