[ML] Multi-line start patterns for CSV from the text structure endpoint are fragile

When analysing CSV where some of the records are multi-line, the text structure endpoint needs to create a regular expression that will match the first line of each CSV record.

This seems silly, as CSV is a well-defined format where it's clear where each record ends and the next begins. However, there's an underlying problem here in the way the Elastic stack ingests data:

First files are split into "messages" - these can be single line or multi-line
Then messages are converted into JSON documents
Then these JSON documents are sent to Elasticsearch, where further processing is performed
This processing chain means that it's necessary to split a CSV file into separate "messages" in step 1 before finally parsing the individual CSV records using a proper CSV parser in step 3.

If you look at how both Logstash and Beats split files into messages, they use a regular expression to detect lines that are either the first or last line in a message. This is how the lines of multi-line messages get grouped.

So, the text structure endpoint needs to find a field near the beginning of each CSV record that's simple enough to match with a regular expression.

The way this is done has been improved over the years, first in #51737 and then in #85066. But it's still not perfect.

The requirement is that each CSV record have a field that is either boolean, numeric, date or low-cardinality keyword that comes before any field that contains a newline in any record in the sample provided. The definition of "low cardinality" is basically 5 or fewer values. However, there is an additional restriction that these 5 values cannot appear in any other field on the lines, because then the regular expression could match those other fields instead.

This means that it's easy to create a CSV file that the text structure endpoint cannot process, simply by putting a field with newlines near the beginning of the records.

The current workaround is to rearrange the column order so that the first column on each line is either numeric or the primary date field of the file.

elastic / elasticsearch

[ML] Multi-line start patterns for CSV from the text structure endpoint are fragile #92798