Open travis-cook-sfdc opened 1 year ago
I spent a little bit more time with this and now understand why 4️⃣ was an issue.
public class FileTest {
public static void matches(Path path, String glob){
PathMatcher matcher = FileSystems.getDefault().getPathMatcher(glob);
System.out.println(matcher.matches(path));
}
public static void main(String[] args) throws IOException {
Path path = Paths.get("s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet");
System.out.println(path.toString());
matches(path, "regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$");
matches(path, "regex:^s3:/redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$");
}
}
FileTest.main(new String[] {})
Returns
s3:/redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet
false
true
Because Pinot regex matches on the Java Path object using getPathMatcher
, and java path's convert //
to /
, it's critical that the regex matches that are sent for ingestion are aware of that fact.
I think it would be useful to clean up the documentation significantly.
According to the docs,
includeFileNamePattern
andexcludeFileNamePattern
are documented like:A few issues here:
1️⃣ The example of
regex:^..(avro)$
does not actually work. When running a job with this pattern, you'll get an error like thisI'm assuming this because of the templating that was introduced in #5341 (also not documented) , but job spec's appear to have special handling for both
$
, which needs to be escaped:\$
, and backslashes which are automatically escaped to\\
2️⃣ Related to the above, it's not clear how someone would write a single backslash character in their regex. For example, I think this is an impossible regex to use
.*\.parquet$
because it's not clear how to get the single backslash character.\
turns into\\
and\\
stays as\\
. This issue can be worked around by using character classes and writing.*[.]parquet$
, but it feels wrong.3️⃣ What flavor of regex is actually being used here?
regextester.com
linked in the documentation only supports PCRE and Javascript regex. However, I suspect this really java regex, which has different syntax. Given the code uses PathMatcher, it's java regex. Pinot should link to a regex tester that will be accurate4️⃣ Can you provide some examples of the absolute path I should be matching to? I've submitted an ingestion job spec that has
includeFileNamePattern: regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$
I have an s3 file with the following name at the path:
s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet
According to regex101.com, this is a match using Java8 syntax: https://regex101.com/r/9ZKOhm/1
It's unclear to me what I'm doing wrong that's causing this pattern to not match.