brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

"zar find" not returning results as of zq commit 22a5eca #1125

Closed philrz closed 4 years ago

philrz commented 4 years ago

While making updates to the zar README, I noticed that one of the zar find command lines is no longer returning output. The change in behavior starts with commit 22a5eca, which was associated with #1110.

Going back to commit 90dddfc when this last worked and executing the relevant commands from the README:

$ zq zng/*.gz | zar import -s 25MB -
$ zar index :ip
file:///Users/phil/logs/20180324/1521912990.158766.zng: creating index file:///Users/phil/logs/20180324/1521912990.158766.zng.zar/zdx-type-ip
file:///Users/phil/logs/20180324/1521912507.399929.zng: creating index file:///Users/phil/logs/20180324/1521912507.399929.zng.zar/zdx-type-ip
file:///Users/phil/logs/20180324/1521912075.114273.zng: creating index file:///Users/phil/logs/20180324/1521912075.114273.zng.zar/zdx-type-ip
file:///Users/phil/logs/20180324/1521911772.980384.zng: creating index file:///Users/phil/logs/20180324/1521911772.980384.zng.zar/zdx-type-ip

$ zar find -z -x zdx-type-ip 10.47.21.138 | zq -t -
#zfile=string
#0:record[key:ip,count:uint64,_log:zfile]
0:[10.47.21.138;10;/Users/phil/logs/20180324/1521912507.399929.zng;]
0:[10.47.21.138;3;/Users/phil/logs/20180324/1521912075.114273.zng;]
0:[10.47.21.138;1;/Users/phil/logs/20180324/1521911772.980384.zng;]

However, advancing to commit 22a5eca and repeating the same:

$ zq zng/*.gz | zar import -s 25MB -
$ zar index :ip
file:///Users/phil/logs/20180324/1521912990.158766.zng: creating index file:///Users/phil/logs/20180324/1521912990.158766.zng.zar/zdx-type-ip.zng
file:///Users/phil/logs/20180324/1521912507.399929.zng: creating index file:///Users/phil/logs/20180324/1521912507.399929.zng.zar/zdx-type-ip.zng
file:///Users/phil/logs/20180324/1521912075.114273.zng: creating index file:///Users/phil/logs/20180324/1521912075.114273.zng.zar/zdx-type-ip.zng
file:///Users/phil/logs/20180324/1521911772.980384.zng: creating index file:///Users/phil/logs/20180324/1521911772.980384.zng.zar/zdx-type-ip.zng

$ zar find -z -x zdx-type-ip 10.47.21.138 | zq -t -
item does not exist
mccanne commented 4 years ago

The problem here is I changed the convention for naming microindex files since they are now a single file. Before, the name "foo" referred to the microindexed comprised of foo.zng, foo.1.zng, etc. Now you just say "foo.zng".

So if you run it this way, it should work:

$ zar find -z -x zdx-type-ip.zng 10.47.21.138 | zq -t -

This brings up question I have with the UX of the CLI commands here. When I first prototyped things, the idea was that you could say

zar index :type

Then say

zar find :type=value

and zar handled the creation of the microindex file names for you.

But with -x, now you have to specify the microindex. So in the "index" pass, zar is implicitly generating the file name from the indexing rule, but in the "find" path, you have to explicitly specify it.

This doesn't feel right to me. I think either file name should be generated in both cases, or it should be explicitly specified in both cases.

philrz commented 4 years ago

Thanks for the clarification @mccanne. I now know how to change the README, so I'll close this one as "invalid". We can open a separate issue about what you described in your closing statement, as I agree it's a little odd at the moment. I also expect the error messages could be improved. FWIW I completely misinterpreted the item does not exist error, thinking it was a statement that the microindex files had been searched but that the value was not found.

mccanne commented 4 years ago

Also, this speaks to the larger problem of how meta-information is stored about what has been indexed. Previously, I was encoding it in the name of microindex but I see it now lives in zar.json. At some point, we'll have to segment parts of zar.json on a finer basis, e.g., per-day.