gchq / stroom

Stroom is a highly scalable data storage, processing and analysis platform.
https://gchq.github.io/stroom-docs/
Apache License 2.0
435 stars 53 forks source link

Add unique key (STRM_ID, STRM_ATR_KEY_ID) to STRM_ATR_VAL table #518

Open at055612 opened 6 years ago

at055612 commented 6 years ago

Currently the table will allow multiple attribute values for the same stream and key. The existing index on STRM_ID could be replaced by this unique key as STRM_ID if STRM_ID is kept at the front of the compound index.

Worth running select STRM_ID, STRM_ATR_KEY_ID, count(*) from STRM_ATR_VAL group by 1,2 having count(*) > 1; in existing environments to check if any duplicates have been stored.

Need to be cautious here as STRM_ATR_VAL is a MASSIVE table so dropping and creating an index will have big implications in terms of additional index storage and migration time.

burnalting commented 6 years ago

What is the expected outcome from executing select STRM_ID, STRM_ATR_KEY_ID, count(*) from STRM_ATR_VAL group by 1,2 having count(*) > 0;?

Are you looking for count(*)'s > 1?

In one small Stroom instance, select STRM_ID, STRM_ATR_KEY_ID, count(*) from STRM_ATR_VAL group by 1,2 having count(*) > 0; returns 1092812 rows and select STRM_ID, STRM_ATR_KEY_ID, count(*) from STRM_ATR_VAL group by 1,2 having count(*) > 1; return 0.

at055612 commented 6 years ago

sorry, that > 0 was my mistake. I was just testing the query locally and set it to zero to ensure it returned what I though. It should indeed be > 1. Thanks for checking.

burnalting commented 6 years ago

Fine.

I also tested some other larger installations with > 1 and they also returned 0.