[Logs + Metrics UI] Gather requirements for a test data generator

weltenwort commented 4 years ago

Since the creation of test data was identified as a major pain during the last team retrospectives, let's gather some requirements a solution would have to fulfill to cover most of our test data needs. The goal is to arrive at a set of requirements that can be used to implement a test data generation API and tools.

In particular, I'd be interested in answers to following questions:

What would be common kinds of values we need? (dates, numbers, keywords, text)
What statistical parameters of these values do we need to control?
Which number of documents would be required for some expected test scenarios?

I'd love to hear as much input as possible on the above aspects in the comments so I can amend the requirements I started out with.

Requirements

declarative schema: The test data specification should be declarative so it is decoupled from the specific implementation and can be version controlled and diffed easily.
reproducible test data generation: The test data are based on an explicitly provided seed and lead to identical results given an identical declaration and seed in repeated calls.
stable under schema evolution: When the schema evolves over time, additions, removals, and changes in some parts of the schema don't unnecessarily change the generated data in other parts.
data preview: The test data and mappings can be previewed during test development.
versioned data generation algorithms: The algorithms for generating the individual pieces of data are versioned such that the tool can evolve while keeping old schemata valid and the generated data stable.
created on-the-fly: The test data are generated at runtime without being under version control themselves.
create ES documents during tests: The test data are indexed into ES as part of the test setup process (i.e. in before()).
create ES mappings during tests: The schema has sufficient details to generate ES mappings which are sent to ES before data are indexed.
symmetric creation and cleanup: The test data and mappings can be cleanly removed from ES during test teardown (i.e. in after()).

API Proposals

Schema

A possible way of specifying both the mapping and the statistical properties of the generated documents would be to use a superset of the index creation body as the schema. It could be enhanced with some data-generator meta keys (__data_generator in the following example), which would be interpreted and stripped out during data generation:

{
  "__data_generator": {
    "type": "singleIndex",
    "default_seed": "foo",
    "samples": 10000,
    "index_name": "filebeat-8.0.0-generated"
  },
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date",
        "__data_generator": {
          "type": "randomUniform",
          "min": 1581452864420,
          "max": 1581539264420
        }
      },
      "event": {
        "type": "object",
        "properties": {
          "dataset": {
            "type": "keyword",
            "__data_generator": {
              "type": "randomPickOne",
              "values": ["word1", "word2", "word3", "word4"]
            }
          }
        }
      },
      "message": {
        "type": "text",
        "__data_generator": {
          "type": "randomPattern",
          "patterns": [
            {
              "pattern": "{{ method }} {{ url }} {{ statusCode }}",
              "values": {
                "method": {
                  "type": "randomPickOne",
                  "values": ["GET", "POST"]
                },
                "url": {
                  "type": "randomUrl"
                },
                "statusCode": {
                  "type": "randomPickOne",
                  "values": [200, 404]
                }
              }
            }
          ]
        }
      }
    }
  }
}

Test API

import exampleSchema from './example_schema.json';

const seed = "seedForThisTestSuite";

export default function createExampleTestSuite({ getService }: FtrProviderContext) {
  const dataGenerator = getService('dataGenerator');

  describe('my test suite', () => {
    before(async () => {
      await dataGenerator.generateData(exampleSchema, seed);
    });

    after(async () => {
      await dataGenerator.removeData(exampleSchema);
    })

    it('behaves as expected', async () => {
      // ...
    });
  });
});

elasticmachine commented 4 years ago

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

elasticmachine commented 10 months ago

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

weltenwort commented 9 months ago

IMO we could close this now that we can generate logs using synthtrace.

elastic / kibana