gchq / stroom-docs

Documentation for Stroom and associated projects
Apache License 2.0
30 stars 25 forks source link

Add details about data store file types #62

Open stroomdev66 opened 2 years ago

stroomdev66 commented 2 years ago

Related to this section of the user guide: https://gchq.github.io/stroom-docs/hugo-docsy/docs/user-guide/concepts/streams/

A stream is either a single piece of data or several pieces that are joined together for the sake of efficient storage and processing.

Files ending *.mf.dat are manifest files and should be a plain text file you can open that provides details of the stream, i.e. the high level attributes of the whole stream rather than the individual entries.

All other files are either block gzip data (.bgz) or are an index (.bdy.dat and *.seg.dat).

BGZ files are a series of GZIP chunks of data appended together.

The index files are a series of byte offsets stored as Java long values (8 bytes per number), that tell stroom where the split points are between the GZIP chunks.

You will only see the *.seg.dat index files stored with processed data that is configured to segment the output. Segmenting the output means that an index is written that allows the system to seek to a specific event without having to decompress the whole stream. Instead it just decompresses the appropriate chunk and can read the event straight from that byte position.

In addition to these different types of file you will see some additional parts of the extension that indicate the type of data that is stored in the BGZ. These are as follows:

RAW_EVENTS, "revt"
RAW_REFERENCE, "rref"
EVENTS, "evt"
REFERENCE, "ref"
TEST_EVENTS, "tevt"
TEST_REFERENCE, "tref"
META, "meta"
ERROR, "err"
CONTEXT, "ctx"
DETECTIONS, "dtxn"
RECORDS, "rec"

Some of these may not exist at all as we moved away from making extensions for each stream type. Some were also experimental.

at055612 commented 2 years ago

This is the page in question https://gchq.github.io/stroom-docs/7.0/docs/user-guide/concepts/streams/