brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.37k stars 67 forks source link

zar index leaves empty files upon indexing error #987

Closed henridf closed 4 years ago

henridf commented 4 years ago

This was described in https://github.com/brimsec/zq/issues/815#issuecomment-651423326. Splitting out into its own bug:

$ zar import multitype.tzng 

$ zar index orig_h
/Users/phil/logs/20091119/1258594908.85978.zng: creating index /Users/phil/logs/20091119/1258594908.85978.zng.zar/zdx-field-orig_h
type of orig_h field changed from string to ip

$ tree -s $ZAR_ROOT
/Users/phil/logs
├── [        128]  20091119
│   ├── [         93]  1258594908.85978.zng
│   └── [         96]  1258594908.85978.zng.zar
│       └── [          0]  zdx-field-orig_h.zng
└── [        197]  zar.json

2 directories, 3 files

$ zar find orig_h=192.168.2.1
/Users/phil/logs/20091119/1258594908.85978.zng.zar/zdx-field-orig_h: /Users/phil/logs/20091119/1258594908.85978.zng.zar/zdx-field-orig_h: cannnot read zdx header
mccanne commented 4 years ago

This raises good questions. If we index and get an error, the file should not be left behind. But if we index a chunk, and there were no instances of the key, then I think we want a microindex file with an "empty" index trailer so the microindex pruner can skip a search for said key. In this case, I think an empty trailer wants to include an empty sections[] array and a TypeNull for the key type since we don't know the type of an unencountered key.

1110 doesn't handle this. So after that is merged, I will put up a PR for this change which should also handle the issue at hand here (i.e., not leaving behind microindexes under errors).

philrz commented 4 years ago

Verified in zar commit b7023d5.

Repeating the steps, at this point we can see there's no longer an empty file left behind:

$ tree -s $ZAR_ROOT
/Users/phil/logs
├── [        128]  20091119
│   ├── [        116]  1258594909.85978.zng
│   └── [         64]  1258594909.85978.zng.zar
└── [        248]  zar.json

2 directories, 2 files

That means if the user marched ahead and tried to query against the index anyway, they'd now get a different error message:

$ zar find orig_h=192.168.2.1
item does not exist

This error message is problematic in its own way, but this has already been identified and is tracked in #1137.

Thanks @mccanne!