elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.69k stars 24.39k forks source link

[ML] Multi-line start patterns for CSV from the text structure endpoint are fragile #92798

Open droberts195 opened 1 year ago

droberts195 commented 1 year ago

When analysing CSV where some of the records are multi-line, the text structure endpoint needs to create a regular expression that will match the first line of each CSV record.

This seems silly, as CSV is a well-defined format where it's clear where each record ends and the next begins. However, there's an underlying problem here in the way the Elastic stack ingests data:

If you look at how both Logstash and Beats split files into messages, they use a regular expression to detect lines that are either the first or last line in a message. This is how the lines of multi-line messages get grouped.

So, the text structure endpoint needs to find a field near the beginning of each CSV record that's simple enough to match with a regular expression.

The way this is done has been improved over the years, first in #51737 and then in #85066. But it's still not perfect.

The requirement is that each CSV record have a field that is either boolean, numeric, date or low-cardinality keyword that comes before any field that contains a newline in any record in the sample provided. The definition of "low cardinality" is basically 5 or fewer values. However, there is an additional restriction that these 5 values cannot appear in any other field on the lines, because then the regular expression could match those other fields instead.

This means that it's easy to create a CSV file that the text structure endpoint cannot process, simply by putting a field with newlines near the beginning of the records.

The current workaround is to rearrange the column order so that the first column on each line is either numeric or the primary date field of the file.

elasticsearchmachine commented 1 year ago

Pinging @elastic/ml-core (Team:ML)