Closed sonowz closed 6 months ago
I have looked one more time on this issue and tried to understand its probability in real life example.
Why would you use multi-threading in Flink program yourself/manually?
We usually rely on Flink execution environment to schedule all operators of the job graph and respective tasks for them.
Don't you just let Flink to schedule a job graph which would derive TypeInformation
as many times as needed, thus it will create own typeSerializer
for each task, even if all tasks run in the same TaskManager. I mean that this case with ParSeq above is an artificial example even though it reveals the problem.
No, I'm afraid that this bug affects Flink-controlled internals. It actually happened in my app, which is just a normal Flink app.
I'll try to give an example of such typical app:
source.map[SomeDto](...)
.setParallelism(32)
When this part derives TypeInformation, all tasks always refer to the same TypeInformation instance because it's cached.
Therefore, all tasks call createSerializer()
method of the same instance, and they get the same thread-unsafe TypeSerializer
instance (this behavior could be confirmed using Java debugging tool). Since tasks are typically run in parallel, almost every Flink app has a chance to have data inconsistencies.
Recently I ran a Flink app run in 48-core environment and experienced data inconsistency, and I can confirm after fixing the code the inconsistency no longer happens. I hope this gets fixed soon!
Fixed by #113 and #114
Hi, I stumbled upon the race conditions in a app using this library.
Description
As far as I know,
TypeInformation
class is okay to be used as singleton, whereasTypeSerializer
isn't. JavaDoc ofTypeSerializer
class reads:However, the
TypeInformation
classes in the library just passTypeSerializer
instance, resulting the instance to be used in multiple threads: https://github.com/flink-extended/flink-scala-api/blob/892bd718b4fb0f9c43d815095ce47ddab965b196/src/main/scala/org/apache/flinkx/api/typeinfo/ProductTypeInformation.scala#L18Therefore, this can lead to data inconsistency when used with thread-unsafe TypeSerializer such as
CaseClassSerializer
(It has mutable variable used during deserialization.)Steps to reproduce
This example code shows that the data inconsistency could happen when run in multicore environment:
Suggested fix
Looking at the
createSerializer()
method implementation of POJOTypeInfo in flink-core, it creates a new instance ofTypeSerializer
.In the above example, if the TypeSerializer instantiation is modified like this:
the data inconsistency does not happen anymore.
I'll submit a pull request to fix the issue. Feel free to ask if you have question about the issue or the fix.