apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.42k stars 1.27k forks source link

includeFileNamePattern not working as expected #8805

Closed diogobaeder closed 2 years ago

diogobaeder commented 2 years ago

Hi folks,

This seems to be a bug to me, so I'm opening an issue here right away (instead of bringing this to Slack first).

What I want to do is to be able to run a batch job that uses just one file as the input; Suppose that I have a directory, /some/dir, and then I have a segment data file there, foo.json. So I have inputDirURI set to /some/dir. But then, if I set includeFileNamePattern to any of:

neither of them work.

The only way I can ever get this to work is by using glob:**/foo.json, which is counter-intuitive and probably inefficient - I really only need to process one file per batch in my case. Sure, it works, but I think this has some room for improvement, right? Or maybe there should at least be an alternative like file:foo.json?

Thanks!

Jackie-Jiang commented 2 years ago

cc @xiangfu0

xiangfu0 commented 2 years ago

If I remember correctly, this match works on the entire absolute path for input files, not the relative path. That's why I used **/ to match the prefix of the full path.

kkrugler commented 2 years ago

@xiangfu0 is correct, you need to use glob:**/foo.json

See https://www.digitalocean.com/community/tools/glob?comments=true&glob=%2A%2A%2F%2A.json&matches=false&tests=%2Fsome%2Fdir%2Ffoo.json for a useful site to debug glob patterns.

kkrugler commented 2 years ago

@diogobaeder - why do you think glob:**/foo.json is inefficient? You are restricting the match space to what's at your inputDirURI.

diogobaeder commented 2 years ago

Ahhh, got it! I didn't know it was for the absolute path - I think the combination of inputDirURI and includeFileNamePattern makes this a bit confusing, as there's no mention to the fact in the docs ( https://docs.pinot.apache.org/configuration-reference/job-specification#top-level-spec ).

Can we change the docs there, then, and put a note that the glob and regex both apply to the absolute path, and are not taken relative to inputDirURI? This will probably avoid the same confusion with other users.

diogobaeder commented 2 years ago

@kkrugler because it might have to traverse through all matching file paths according to the glob, and then filter in the ones that match inputDirURI, no? I mean, this is just speculation really, so I might be totally wrong as I don't know the implementation, but it's what my sense tells me the code is probably doing. Anyway, even if this is what happens, it's probably not a big deal (unless there's a case of traversing directories with too many paths, which depending on the implementation might be a bit slow).

Jackie-Jiang commented 2 years ago

@diogobaeder Pinot will create a PathMatcher from the given file name pattern, then iterate over all the files under the provided data dir and match it with the PathMatcher. So the traverse will be just over the inputDirURI. Good suggestion on adding a note to the doc to avoid the confusion. Do you want to help contribute that? You may submit a PR to this repo: https://github.com/pinot-contrib/pinot-docs

diogobaeder commented 2 years ago

Ah, nice! Sure, I'll try to do that change ASAP, thanks guys!

diogobaeder commented 2 years ago

There you go, folks: https://github.com/pinot-contrib/pinot-docs/pull/101 - sorry I forgot to link to GitHub issues (and couldn't find out how to do this).

Jackie-Jiang commented 2 years ago

Thanks @diogobaeder Merged the change to the documentation repo

diogobaeder commented 2 years ago

Thanks @Jackie-Jiang :-)