brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

occasional file i/o "access is denied" errors during pcap ingest in windows CI with brim #613

Closed mikesbrown closed 4 years ago

mikesbrown commented 4 years ago

Several tests for Brim / zq integration involve uploading a pcap through Brim and then doing something with it. In one run, there were i/o errors that prevented ingest from succeeding.

One was

CreateFile
C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\
scoped_dir596_1117112332\\upload596_235124938\\
sample.pcap.brim\\.tmp.ingest\\capture_loss.log:
Access is denied.

The other was

rename C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\
scoped_dir940_1959449935\\upload940_1015111360\\
sample.pcap.brim\\all.bzng.tmp 

C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\
scoped_dir940_1959449935\\upload940_1015111360\\
sample.pcap.brim\\all.bzng: Access is denied

The bug is filed here as the error is happening in zqd and being propagated to Brim.

The file being ingested is sample.pcap from brim.git.

The logs of the run are available here.

brim was at https://github.com/brimsec/brim/commit/1890b07c91cf63f8d64fdb5c764229046ee6bc8e . zq was at tag v0.11.1. Note that windows integration testing is quite new; we don't have many past data points. But so far this is the only known instance.

mikesbrown commented 4 years ago

Happened here https://github.com/brimsec/brim/runs/637035303?check_suite_focus=true . Confirmed with screen shot. The zq hash is e7b866d which is an experimental branch. No changes were made to zq itself on this branch. The parent hash from master is e40934a

alfred-landrum commented 4 years ago

thanks for pointer @mikesbrown - I did pull the archive for that test run. The logs have very little data to help with the issue, I may need to add some to help triage.

mikesbrown commented 4 years ago

https://github.com/brimsec/brim/actions/runs/95611919

Looks like this can now manifest with a user-facing message like this:

Detail: D:\var\tmp\brimsec\itest\histogram.test.js\1\data\spaces\sample.pcap.brim\.tmp.ingest\conn.log: line 1: bad format

In histogram.test.js/1/logs/zqd-core.log I still see the same story as before:

{
  "level": "warn",
  "ts": 1588625148.0924168,
  "msg": "error",
  "request_id": "4",
  "status": 500,
  "error": "CreateFile D:\\var\\tmp\\brimsec\\itest\\histogram.test.js\\1\\data\\spaces\\sample.pcap.brim\\.tmp.ingest\\capture_loss.log: Access is denied."
}

This was zq hash 335d284.

One thing I noticed is that histogram.test.js/1/data doesn't even exist. All the other data dirs, for passing tests, do.

alfred-landrum commented 4 years ago

I believe this is because the rename operation occurs while a search is active on the space. At the filesystem level, it means the rename (MoveEx in the win32 api) occurs while there is an open handle that was not called with FILE_SHARE_DELETE. It is not possible to specify that flag with the stdlib os.Open call. (There has been quite a lot of discussion about how to handle this in the Go community; see https://github.com/golang/go/issues/32088 and https://github.com/golang/go/issues/34681).

The current "best" way to get Posix like semantics for opens not interfering with renames is to essentially copy all of the code from stdlib involved in file opens, with the added ability to specify FILE_SHARE_DELETE. You can see an example of that in moby/moby (basis of Docker): https://github.com/moby/moby/commit/a5f237c2b54177ffe45cf371461db1892e092452

Alternatively, we could protect against this above the filesystem, with reader/writer protection against overwriting the all.zng file. @mattnibs is currently working on revamping the storage interface to spaces, which will centralize access to the file, where we could implement the protection scheme.

However, since this will likely not be the last time we expect Windows file access to have these semantics, I lean towards taking the copy-all-os-Open approach, and making sure that all file opens in zq go through that code going forward.

alfred-landrum commented 4 years ago

We've seen another instance of this after https://github.com/brimsec/zq/pull/741 was committed.

alfred-landrum commented 4 years ago

Just an update: I've been trying to reproduce this with a brim/zq branch with more logging information, and I can't get to repro on that test branch. Since the new mingw Zeek artifact for windows, it also seems like the frequency of occurrence has gone down, but not decreased. Here's a recent example of it: https://github.com/brimsec/brim/runs/721517441?check_suite_focus=true

mikesbrown commented 4 years ago

Happened here https://github.com/brimsec/brim/runs/740498559?check_suite_focus=true

alfred-landrum commented 4 years ago

Just FYI: this bug repro'd very sporadically, so even though I think #925 addresses it, it's hard to be sure.