Closed diogobaeder closed 2 years ago
cc @xiangfu0
If I remember correctly, this match works on the entire absolute path for input files, not the relative path.
That's why I used **/
to match the prefix of the full path.
@xiangfu0 is correct, you need to use glob:**/foo.json
See https://www.digitalocean.com/community/tools/glob?comments=true&glob=%2A%2A%2F%2A.json&matches=false&tests=%2Fsome%2Fdir%2Ffoo.json for a useful site to debug glob patterns.
@diogobaeder - why do you think glob:**/foo.json
is inefficient? You are restricting the match space to what's at your inputDirURI
.
Ahhh, got it! I didn't know it was for the absolute path - I think the combination of inputDirURI
and includeFileNamePattern
makes this a bit confusing, as there's no mention to the fact in the docs ( https://docs.pinot.apache.org/configuration-reference/job-specification#top-level-spec ).
Can we change the docs there, then, and put a note that the glob and regex both apply to the absolute path, and are not taken relative to inputDirURI
? This will probably avoid the same confusion with other users.
@kkrugler because it might have to traverse through all matching file paths according to the glob, and then filter in the ones that match inputDirURI
, no? I mean, this is just speculation really, so I might be totally wrong as I don't know the implementation, but it's what my sense tells me the code is probably doing. Anyway, even if this is what happens, it's probably not a big deal (unless there's a case of traversing directories with too many paths, which depending on the implementation might be a bit slow).
@diogobaeder Pinot will create a PathMatcher
from the given file name pattern, then iterate over all the files under the provided data dir and match it with the PathMatcher
. So the traverse will be just over the inputDirURI
.
Good suggestion on adding a note to the doc to avoid the confusion. Do you want to help contribute that? You may submit a PR to this repo: https://github.com/pinot-contrib/pinot-docs
Ah, nice! Sure, I'll try to do that change ASAP, thanks guys!
There you go, folks: https://github.com/pinot-contrib/pinot-docs/pull/101 - sorry I forgot to link to GitHub issues (and couldn't find out how to do this).
Thanks @diogobaeder Merged the change to the documentation repo
Thanks @Jackie-Jiang :-)
Hi folks,
This seems to be a bug to me, so I'm opening an issue here right away (instead of bringing this to Slack first).
What I want to do is to be able to run a batch job that uses just one file as the input; Suppose that I have a directory,
/some/dir
, and then I have a segment data file there,foo.json
. So I haveinputDirURI
set to/some/dir
. But then, if I setincludeFileNamePattern
to any of:regex:foo.json
regex:foo\.json
glob:foo.json
glob:*/foo.json
neither of them work.
The only way I can ever get this to work is by using
glob:**/foo.json
, which is counter-intuitive and probably inefficient - I really only need to process one file per batch in my case. Sure, it works, but I think this has some room for improvement, right? Or maybe there should at least be an alternative likefile:foo.json
?Thanks!