criteo / biggraphite

Simple Scalable Time Series Database
Apache License 2.0
129 stars 36 forks source link

Schema version mismatch #556

Closed zerosoul13 closed 3 years ago

zerosoul13 commented 3 years ago

Hello team,

Our DBA team has found a couple of times an issue where there's a Schema version mismatch detected. When this happens, the only option we have is to do a rolling restart of the cluster.

I've seen mentions of you guys running a BigGraphite cluster with a good amount of Cassandra nodes (our setup has 16 nodes) so wanted to ask if you have come across this issue before. The reason why there's not much more details about this is because the issue has only happened a couple of times and haven't been able to do full event correlation to clearly point out any other possible issues.

Any comments are appreciated

geobeau commented 3 years ago

The biggest one is 80 servers split in two datacenters (40 servers each with cross dc replication).

We regularly have this message Schema version mismatch detected, even in non-biggraphite cluster I think. We have found that it doesn't really have any impacts, do you see one?

Maybe you can have more data by running nodetool describecluster

zerosoul13 commented 3 years ago

In our case, I recall not being to do much with BigGraphite data until Cassandra rolling restart was done. Digging through chat history, I found the following error message:

biggraphite.drivers._utils.Error: Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0, 'write_type': 'SIMPLE'}

geobeau commented 3 years ago

Do you have a replication factor of at least 2 ?

zerosoul13 commented 3 years ago

I've checked with our DBA team and we use RF=2