brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.38k stars 64 forks source link

better memory management ux for zar import #974

Open mikesbrown opened 4 years ago

mikesbrown commented 4 years ago

I'm on a rather large system testing zar import with a 25G conn log. The system has about 60G of RAM, but zar import isn't using much of it. Worse, /tmp is on a small root partition. Because sorting is spilling, the spilling causes this small root partition to run out of disk space. The import is not able to complete. I have all this RAM but I have no way to use it. It doesn't seem like we should be spilling at first when there's all this memory to be used.

Some thoughts on improvement here:

philrz commented 4 years ago

Sounds like it might also be handy if our tools could do the equivalent of a df on whatever path we're about to use as the tempdir, see how much free space is there, and warn/exit if there's inadequate free space. There'd still be exposure if we started out seeing enough space and then some other process started filling it up after the fact. But it would hopefully rule out some of the most severe problems like the one described here, and be an opportunity to call the user's attention to an option to set the spillable disk root, should we choose to add one.

philrz commented 4 years ago

Since zq also does spill-to-disk sorting, the exposure described here is not unique to zar. I don't know the innards of the code, but I assume they may share code in this regard.

Just to give an example of user experience in the Brim/zq context when this happens, here's a scenario I just created for repro. The setup was similar to what's described in https://github.com/brimsec/zq/issues/1039 with the 35 GB pcap, though in this case I upped the VM's memory to 8 GB to get around that OOM problem. I then created a 400 MB ramdisk and used it as my /tmp directory (mount -t tmpfs -o size=400m tmpfs /tmp). That constrained zq's access to temporary storage for things like the Zeek logs as they're being generated from the pcap (e.g. /tmp/zqd-pcap-ingest-202947003) and spilled-to-disk data while sorting those logs to make ZNG (e.g. /tmp/zq-sort-323602114). I dragged the pcap into Brim and watched as these temp directories appeared/grew:

# cd /tmp
# while true; do   du -sh zqd-*; du -sh zq-sort-*; du -sh . 2>/dev/null; sleep 5; done

At the moment when the first spill-to-disk sort kicked in, the pcap-ingest temp directory was just under 300 MB in size, so the spill-to-disk data was big enough to exceed what was available. This is what I experienced as a user in the app:

image

Conclusion: At least we're good about accurately reporting the problem.

nwt commented 4 years ago
  • User option to make the memory usage different relative to the default from above

It's worth mentioning that zq has -sortmem for this.

  • User option to set the spillable disk root, overriding the TempDir("") to be the only choice

https://pkg.go.dev/os?tab=doc#TempDir describes how to change the directory for temporary files using an environment variable.

alfred-landrum commented 4 years ago

931 is a related issue that tracks memory accounting & availability for groupby, which is another user of sort.RunManager machinery to spill to disk.

marktwallace commented 4 years ago

I have a reproducible case of this problem, using an a1.large ec2 instance that has not SSD drive. For converting a 264 MB S3 object, I got the following timing: [ec2-user@ip-172-31-23-217 ~]$ time ~/go/bin/zar import -R s3://zqd-demo-1/mark/zeek-logs/dns s3://brim-sampledata/wrccdc/zeek-logs/dns.log.gz real 18m51.029s user 3m39.511s sys 0m33.529s

The real time is much longer than the cpu time. I'm thinking of adding a -sortmem flag (Noah's idea above) to see it improves that way.

marktwallace commented 4 years ago

OK, I tried this on another EC2 instance type, c5d.large, which has an SSD, and is amd64 arch. The real-time was not any better, but it did take less CPU time:

[ec2-user@ip-172-31-41-236 zq]$ time ~/go/bin/zar import -R s3://zqd-demo-1/mark/zeek-logs/dns2 s3://brim-sampledata/wrccdc/zeek-logs/dns.log.gz

real 19m18.920s user 2m1.841s sys 0m15.223s [ec2-user@ip-172-31-41-236 zq]$ df Filesystem 1K-blocks Used Available Use% Mounted on devtmpfs 1878544 0 1878544 0% /dev tmpfs 1896796 0 1896796 0% /dev/shm tmpfs 1896796 392 1896404 1% /run tmpfs 1896796 0 1896796 0% /sys/fs/cgroup /dev/nvme0n1p1 8376300 2588736 5787564 31% / tmpfs 379360 0 379360 0% /run/user/1000

I included the 'df' output to show that the instance has an SSD at /dev/nvme0n1p1, and it appears that the SSD was used.

I expected this run to be faster based on the presence of the SSD for sort spill.

philrz commented 4 years ago

In the time since PR #1203 was merged, I've done some black box testing with an increased -sortmem. A summary of the detailed results below:

  1. I'd heard it said that the value for the -sortmem parameter could not be assumed to behave like a hard ceiling on the amount of consumed memory, and I'd been told of observations that the actual observed memory used by the process might some multiple of the value. The results below show a multiple of almost exactly 3x.

  2. In the spirit of the opening text of this issue, I was taking on the persona of a user that assumed that some of the excess memory could surely be put to use in the interest of improving performance and reducing disk I/O. I was testing with a Google Cloud n1-standard-16 instance, which has 60 GB of memory. I also provisioned it with one local SSD to hold test data and to use as $TMPDIR for spill space. The results below, however, indicated no improvement from using an increased value -sortmem 10000


Here's the detailed results from a run with zar commit 2608a46. With the 10 MiB setting, the operation completed in a little over 95 minutes, with a flat-ish memory peak period and an absolute peak at 30.8 GiB, indicating a 3x fudge factor relative to the my requested 10,000 MiB limit.

# time (zq *.log.gz | zar -sortmem 10000 import -s 25MB -)

real    95m40.528s
user    173m3.706s
sys     14m23.569s

image

The CPU chart showing relative CPU usage between zq and zar shows an interesting pattern, apparently due to the spill-to-disk sorting happening in 10 GB increments.

image

# while true; do   du -sh zq-sort-*;   sleep 60; done
4.0K    zq-sort-183956291
4.0K    zq-sort-183956291
7.2G    zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
10G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
20G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
30G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291
43G zq-sort-183956291
50G zq-sort-183956291
50G zq-sort-183956291
50G zq-sort-183956291
50G zq-sort-183956291
50G zq-sort-183956291
50G zq-sort-183956291
50G zq-sort-183956291
50G zq-sort-183956291
40G zq-sort-183956291
40G zq-sort-183956291

For the comparative baseline without increased -sortmem, we observe the low/flat memory curve, and also see validation of something @marktwallace had observed: It actually took slightly less time and consumed slightly less CPU by using less memory and relying on spill-to-disk in these smaller increments.

# time (zq *.log.gz | zar import -s 25MB -)

real    92m59.811s
user    160m50.622s
sys     13m34.865s

image

image

# while true; do   du -sh zq-sort-*;   sleep 60; done
132M    zq-sort-646119493
657M    zq-sort-646119493
1.2G    zq-sort-646119493
1.8G    zq-sort-646119493
2.4G    zq-sort-646119493
3.0G    zq-sort-646119493
3.6G    zq-sort-646119493
4.4G    zq-sort-646119493
5.0G    zq-sort-646119493
5.7G    zq-sort-646119493
6.2G    zq-sort-646119493
6.8G    zq-sort-646119493
7.5G    zq-sort-646119493
8.4G    zq-sort-646119493
8.7G    zq-sort-646119493
9.4G    zq-sort-646119493
9.9G    zq-sort-646119493
11G zq-sort-646119493
11G zq-sort-646119493
12G zq-sort-646119493
12G zq-sort-646119493
13G zq-sort-646119493
14G zq-sort-646119493
14G zq-sort-646119493
15G zq-sort-646119493
16G zq-sort-646119493
16G zq-sort-646119493
17G zq-sort-646119493
18G zq-sort-646119493
19G zq-sort-646119493
19G zq-sort-646119493
20G zq-sort-646119493
20G zq-sort-646119493
21G zq-sort-646119493
21G zq-sort-646119493
22G zq-sort-646119493
22G zq-sort-646119493
23G zq-sort-646119493
24G zq-sort-646119493
24G zq-sort-646119493
25G zq-sort-646119493
25G zq-sort-646119493
25G zq-sort-646119493
26G zq-sort-646119493
26G zq-sort-646119493
27G zq-sort-646119493
28G zq-sort-646119493
28G zq-sort-646119493
29G zq-sort-646119493
29G zq-sort-646119493
30G zq-sort-646119493
30G zq-sort-646119493
31G zq-sort-646119493
32G zq-sort-646119493
32G zq-sort-646119493
33G zq-sort-646119493
34G zq-sort-646119493
34G zq-sort-646119493
35G zq-sort-646119493
35G zq-sort-646119493
36G zq-sort-646119493
36G zq-sort-646119493
37G zq-sort-646119493
37G zq-sort-646119493
38G zq-sort-646119493
38G zq-sort-646119493
39G zq-sort-646119493
40G zq-sort-646119493
40G zq-sort-646119493
41G zq-sort-646119493
42G zq-sort-646119493
42G zq-sort-646119493
43G zq-sort-646119493
44G zq-sort-646119493
44G zq-sort-646119493
45G zq-sort-646119493
46G zq-sort-646119493
47G zq-sort-646119493
47G zq-sort-646119493
48G zq-sort-646119493
48G zq-sort-646119493
49G zq-sort-646119493
49G zq-sort-646119493
50G zq-sort-646119493
50G zq-sort-646119493
48G zq-sort-646119493
39G zq-sort-646119493
34G zq-sort-646119493
29G zq-sort-646119493
23G zq-sort-646119493
17G zq-sort-646119493
12G zq-sort-646119493
5.2G    zq-sort-646119493
philrz commented 4 years ago

I'm reopening this one. We discussed it as a group and agreed there's still work left to be done here. A quick recap of where we stand relative to the opening remarks, subsequent comments, and linked PR:

When we pick up this topic, we can open separate issues for the ones above that don't already have them, perhaps attached to an Epic.

mccanne commented 2 years ago

This issue should be rewritten in the context of zed lake. There is a larger project around memory use. For now, leaving this on the back burner and there is a lot of good content here to be recast into the newer architecture.