gchq / stroom

Stroom is a highly scalable data storage, processing and analysis platform.
https://gchq.github.io/stroom-docs/
Apache License 2.0
434 stars 53 forks source link

Use a vocabulary for the ref data fastinfoset serialiser #1945

Open at055612 opened 3 years ago

at055612 commented 3 years ago

A lot of the XML values in the ref data store will contain a lot of the same content, i.e. typically event-logging xml element/attribute names. The serialised size of the values could be reduced if we used a vocabulary.

We could parse the event-logging xsd to add all element and attribute names to the vocabulary. The question is how would we store the vocabulary, it can't really be generated on the fly in stroom as the stroom instance may not have the event-logging schema. We may need to have some optional process that processes some ref data and generates a vocab from it, storing it (or an ordered list of the names so it can be regenerated on boot.) in an LMDB db. Thought would be needed on how any change to that vocab is handled, e.g. as the event-logging schema evolves or if different ref data is used. Each ref stream definition may need to be associated with a particular vocab instance.

The degree of benefit (reduced disk+memory usage) of using a vocab would need to be tested to see if it is worth the added complexity.

at055612 commented 1 year ago

Some useful stuff here about using a vocab https://stackoverflow.com/questions/4563431/has-anyone-else-seen-the-java-xml-fastinfoset-library-corrupt-text

Potentially we could use com.sun.xml.fastinfoset.tools.VocabularyGenerator to capture the sax events of the ref data filter and do something like this:

  1. init an empty generator
  2. process one event.
  3. get the serialiser+parser vocab from the generator
  4. give the serialiser vocab to the FIS serialiser for the next event.
  5. repeat

What is not clear is whether you can build up the vocab in this way as the vocab for each event is slightly different as it is evolving as we move along. As the serialiser for event 1 is only given the vocab for event 1 after processing event 1, can is parse event 1 if given the parser vocab that includes event 1 vocab up front?

All needs fleshing out in a simple junit.

at055612 commented 1 year ago

See the test stroom.pipeline.refdata.TestFastInfosetVocab

This shows roughly how to generate a vocab for some sax events, use it to serialise, then add to it with another set of sax events to serialise something else. For 424 bytes of xml as a str this is serialised to 229b with no vocab. Add in the vocab (with char data) and it goes down to 50b, without the char data it is 118b.

Given that the ref store already de-dups values, we could probably NOT add char data chunks to the vocab. Char data would make the vocab very big and would mean it changes frequently.

Hopefully we could use a single vocab for the whole store (or if we do #3219, one vocab per feed specific store) and just keep adding to it. Any time it changes it would need to be written to LMDB to ensure that any value in the store that was serialised with that version of the vocab can be deserialised. It probably only needs to be written to LMDB on completion of any load.

To give us backwards compatibility for values with no vocab we may need to create a new impl of RefDataValue so we can distinguish values that need to be de-serialised with a vocab and those that don't.

The vocab will have to be read into memory from lmdb then shared by all lookup/load threads that need it. Mutation of it will probably need to be done optimistically, e.g. by adding to a local vocab then trying to CAS with the shared one, but this would rely on being able to replay sax events to add to a newer local copy.

Need to figure out how we can add to the vocab as we get each sax event for the ref data value, then immediately make use of it to serialise that sax event. Potentially we can keep updating the FI serialiser with the latest vocab for each sax event.

at055612 commented 11 months ago

The lib https://mvnrepository.com/artifact/com.sun.xml.fastinfoset/FastInfosetUtilities has com.sun.xml.analysis.frequency.SchemaProcessor that appears to be able to generate a vocab from a schema.

The Vocabularly is just some sets of strings so this could be serialised to json or similar for storing in a table in the ref store.