elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.18k stars 24.84k forks source link

[Text Structure][ML] Improve multi-line start pattern recognition when no timestamps are present #79708

Open jgowdyelastic opened 3 years ago

jgowdyelastic commented 3 years ago

Two example files were supplied by a user:

  1. poc-noprob-310.csv
  2. poc-repro-311.csv

The find_structure endpoint will only produce a multiline_start_pattern when the first line of a multi-lined document includes a field which it assumes is a date field. In the examples provided, it appears the very long (300+ char) number is seen as being a time stamp and so a multiline_start_pattern is produced. Without this multiline_start_pattern the file upload plugin in kibana cannot correctly parse the line as it will assume that the newline character in col2 is the end of the line.

Is it possible to produce multiline_start_pattern when a date field does not edit on the first line of a multi-lined message?

elasticmachine commented 3 years ago

Pinging @elastic/ml-core (Team:ML)

droberts195 commented 3 years ago

It's not completely true to say that there needs to be a timestamp present. But there needs to be a field that always appears on the first line of each CSV record. The current logic is described here (and the code is underneath this comment):

https://github.com/elastic/elasticsearch/blob/68817d7ca29c264b3ea3f766737d81e2ebb4028c/x-pack/plugin/text-structure/src/main/java/org/elasticsearch/xpack/textstructure/structurefinder/DelimitedTextStructureFinder.java#L739-L750

The next thought is of course to say this is crazy, just run the whole file through a CSV parser. However, the problem arises because of the way Elasticsearch and its data shippers have always worked by first splitting files into messages, then parsing these messages individually. Because of this approach "just run the whole file through a CSV parser" is not nearly as simple as it sounds within the Elastic ecosystem. The best way to accurately determine where one CSV record ends and the next starts is to use a proper CSV parser, not regexes. This was recognised in Logstash nearly 7 years ago - see https://github.com/elastic/logstash/issues/2088#issuecomment-63423115. But the Filebeat/Ingest pipeline combo still relies on using regexes to split the file into messages, then running a CSV parser on each message separately. Filebeat does now actually have a built-in CSV parser - see https://github.com/elastic/beats/pull/11753 - but still this only parses an individual field (which could be a full message) after the file has been split into messages.

This means that the text structure finder has a very difficult task if the CSV rows are such that the first field in each CSV record is a string field that could potentially be multiline. It needs to find something common about the bits of that field that always appear on the first line that could be used to detect the first line. Whatever regex is chosen must not then match any other line.

The workaround is to rearrange the column order such that a column that comes before the first string field that might be multiline is a number, boolean or the detected date field.

If no such field can be found then we could potentially try and do something with the first string field, looking for commonality across all messages that never appears on follow-on lines of messages. Before doing this it would be nice to have a concrete example of a real file that would show that this approach would be useful instead of the made up examples attached.

The other potential change would obviously be in Filebeat/Elastic Agent, where if that process had the ability to use a CSV parser to split the file into messages instead of just using regexes then the text structure finder could generate a config specifying to do that. So that would be a case of opening an enhancement against Filebeat/Elastic Agent, then waiting for that to be implemented, then progressing this issue.

droberts195 commented 2 years ago

85066 should help with this.