gchq / stroom

Stroom is a highly scalable data storage, processing and analysis platform.
https://gchq.github.io/stroom-docs/
Apache License 2.0
431 stars 55 forks source link

Stroom creating 000s of empty shards when it can't access the file system #4151

Closed at055612 closed 6 months ago

at055612 commented 6 months ago

If there is failure of the file system where the index shards reside then the following will happen in stroom.index.impl.IndexShardWriterCacheImpl#getWriterByShardKey:

Tries to open an existing shard. If there is no matching shard rec on the db OR there is but there is an exception opening the shard writer then a null writer is returned. If a null writer is returned it will attempt to create a new shard rec in the DB then open a writer for this new shard. If opening the writer on this shard also fails (likely if there is a FS problem) then then this shard is marked corrupt in the db.

Subsequent threads will do the same thing resulting in many empty shards being created and marked corrupt.

It needs to better handling the failure conditions, e.g. potentially not trying to create a new shard if it errors opening one known to exist.