flink-extended / flink-remote-shuffle

Remote Shuffle Service for Flink
Apache License 2.0
191 stars 57 forks source link

data request should implement crc check strategy #46

Closed lichaojacobs closed 2 years ago

lichaojacobs commented 2 years ago

Motivation I found this version may not implement crc check when sending data. And this is a risk of data inconsistency when byte reversed. 4FCED1F7-9CAB-41F2-A818-30E2BB3EC809

Changes

Test

wsry commented 2 years ago

@lichaojacobs Thanks for opening the issue. Could you explain more about that, what do you mean by "byte reversed" and how can that happen?

lichaojacobs commented 2 years ago

Sorry for my type error, what i mean is bit flipping. What if there are something wrong from underlying system causing bit turning a 0 to a 1 ? And this will cause data inconsistency. Below is bit flipping's explanation:

Bitflips are events that cause individual bits stored in an electronic device to flip, turning a 0 to a 1 or vice versa. Cosmic radiation and fluctuations in power or temperature are the most common naturally occurring causes
wsry commented 2 years ago

@lichaojacobs I guess maybe we should treat different devices differently. For devices like disk and network, the possibility of Bitflips is higher and we should verify the data correctness. Actually, we already do that for disk and for network, the TPC does that and we need do nothing. For memory and CPU, I am not sure if they already have some data correctness verification, but I think the possibility of Bitflips is extremely rare, otherwise, any data in memory can be unreliable, for example, a false bool can change to true suddenly. Checksum can not work either, because the data can suffer Bitflips before we finish calculating checksum. What do you think? Is there any other cases I missed which is also easy to cause Bitflips?

lichaojacobs commented 2 years ago

OK, I will close this issue for now, and may be we can reopen this when Bitflips actually happened