brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

zar import and zar zq S3 support #881

Closed alfred-landrum closed 4 years ago

alfred-landrum commented 4 years ago

As a first step for zar s3 support, make zar import and zar zq work for an S3 data storage location.

philrz commented 4 years ago

Verified in zq version 59b4bcc.

Note that when used with S3, the ZAR_ROOT currently still needs to point at a local directory to store the metadata file. The -data option is used during zar import to specify the S3 destination for storing the files. As shown below, subsequent calls to zar zq find the data in the bucket thanks for the entries in the metadata file.

$ export ZAR_ROOT="$(pwd)"

$ ls -l
[no output... current local directory starts out empty]

$ aws s3 ls s3://zq-881
[no output... S3 bucket starts out empty]

$ zq ~/work/zq-sample-data/zng/*.gz | zar import -data s3://zq-881 -s 25MB -

$ ls -l
total 8
-rw-------  1 phil  staff  995 Jul 20 19:31 zar.json

$ cat zar.json | jq .
{
  "version": 0,
  "data_path": "s3://zq-881",
  "log_size_threshold": 25000000,
  "data_sort_direction": false,
  "spans": [
    {
      "span": {
        "ts": {
          "sec": 1521912792,
          "ns": 331503000
        },
        "dur": {
          "sec": 197,
          "ns": 827263001
        }
      },
      "log_id": "20180324/1521912990.158766.zng"
    },
    {
      "span": {
        "ts": {
          "sec": 1521912549,
          "ns": 366782000
        },
        "dur": {
          "sec": 242,
          "ns": 962024001
        }
      },
      "log_id": "20180324/1521912792.328806.zng"
    },
    {
      "span": {
        "ts": {
          "sec": 1521912335,
          "ns": 728195000
        },
        "dur": {
          "sec": 213,
          "ns": 638203001
        }
      },
      "log_id": "20180324/1521912549.366398.zng"
    },
    {
      "span": {
        "ts": {
          "sec": 1521912152,
          "ns": 519494000
        },
        "dur": {
          "sec": 183,
          "ns": 208346001
        }
      },
      "log_id": "20180324/1521912335.72784.zng"
    },
    {
      "span": {
        "ts": {
          "sec": 1521911975,
          "ns": 778000000
        },
        "dur": {
          "sec": 176,
          "ns": 740493001
        }
      },
      "log_id": "20180324/1521912152.518493.zng"
    },
    {
      "span": {
        "ts": {
          "sec": 1521911841,
          "ns": 543641000
        },
        "dur": {
          "sec": 134,
          "ns": 233828001
        }
      },
      "log_id": "20180324/1521911975.777469.zng"
    },
    {
      "span": {
        "ts": {
          "sec": 1521911720,
          "ns": 600725000
        },
        "dur": {
          "sec": 120,
          "ns": 942916001
        }
      },
      "log_id": "20180324/1521911841.543641.zng"
    }
  ],
  "indexes": {}
}

$ aws s3 ls --recursive s3://zq-881
2020-07-20 19:31:07   23094074 20180324/1521911841.543641.zng
2020-07-20 19:31:01   25476221 20180324/1521911975.777469.zng
2020-07-20 19:30:58   25483926 20180324/1521912152.518493.zng
2020-07-20 19:30:52   25453122 20180324/1521912335.72784.zng
2020-07-20 19:30:48   25499352 20180324/1521912549.366398.zng
2020-07-20 19:30:42   25478311 20180324/1521912792.328806.zng
2020-07-20 19:30:34   25483642 20180324/1521912990.158766.zng

$ zar zq "count()" _ | zq -f text "sum(count)" -
1462078

A couple loose ends for possible consideration:

  1. (Now tracked in #1030) At first I'd mistakenly thought I should set the ZAR_ROOT to the s3:// URL. zar import actually accepted this and populated the bucket, but put the metadata in a local path starting with a directory s3:. It basically still works when I run it on my Mac, but this is not how it's intended, and pathnames with colons in them are invalid on Windows, so this feels like a bug.
$ export ZAR_ROOT="s3://zq-881"

$ ls -l
[no output... current local directory starts out empty]

$  aws s3 ls s3://zq-881
[no output... S3 bucket starts out empty]

$ zq ~/work/zq-sample-data/zng/*.gz | zar import -s 25MB -

$ tree -s
.
└── [         96]  s3:
    └── [         96]  zq-881
        └── [        995]  zar.json

2 directories, 1 file

$ aws s3 ls --recursive s3://zq-881
2020-07-20 19:42:52   23094074 20180324/1521911841.543641.zng
2020-07-20 19:42:45   25476221 20180324/1521911975.777469.zng
2020-07-20 19:42:40   25483926 20180324/1521912152.518493.zng
2020-07-20 19:42:35   25453122 20180324/1521912335.72784.zng
2020-07-20 19:42:30   25499352 20180324/1521912549.366398.zng
2020-07-20 19:42:26   25478311 20180324/1521912792.328806.zng
2020-07-20 19:42:17   25483642 20180324/1521912990.158766.zng

$ zar zq "count()" _ | zq -f text "sum(count)" -
1462078
  1. (Now tracked in #1031) While zar help import mentions the -data option, the lack of coverage of how it's used with S3 contributed to my stumbling into the mistake I just mentioned.

But otherwise, it's working. Thanks @mattnibs!

mattnibs commented 4 years ago

Example of using zar import to send data to an s3 bucket: zar import -R ./root -data s3://bucket/key ./mydata.zng