TritonDataCenter / mako-gc-feeder

A program to generate listings of instruction objects and feed them to the garbage-collection pipeline of the Manta object storage system.
0 stars 2 forks source link

initial review notes #1

Open davepacheco opened 5 years ago

davepacheco commented 5 years ago

I'm filing this issue as a place to put feedback from my initial review. I reviewed commit d5fb9d6f5f17f95773eebf846f16120f9423b3ea.

First, this is great -- a lot of work for a short time! Most of my major feedback relates to performance and operational considerations.

Here are the major issues I think we definitely want to address:

Here are the operational considerations we'll need to answer:

From the above: I think it would be safest if we can deploy this to a new CN on the Manta network.

I'm not sure how big an issue this will be:

jmwski commented 5 years ago
  • The sqlite update and write to the data files should probably happen once per batch, not per record. I think this will have a pretty major impact on both CPU usage and performance due to not having to fsync the sqlite file for each path written.

I've added a checkpoint prototype function that is called in once per 'end' event received from findObjects.

  • I'd start with a batch size of 5K or 10K. The data path is limited to 1K, but I think we can push this a little further.

Agreed, this will remain the default batch size.

  • I'd start f_delay = 0. It's good that we have this option in there, but we should have just one client per shard, and having one request outstanding all the time seems reasonable.

Done.

  • The shard number needs to appear somewhere in the database, either in the table itself or in the name of the database file. Otherwise, concurrent instances of this program will stomp on each other. I'd probably recommend putting this into the database file name for reasons described later. (Another option would be to take the database name as an argument and putting the shard name into the table.)

Database filenames are now formatted as in 2.moray.orbit.example.com-stream_position.db. Each process will have one client per shard. That client will be the exclusive writer/reader from this database.

  • How many shards will we assign to each process?

I've made this configurable by adding a shards array to the config file accepted by the feeder.

  • How will we orchestrate the processes?

I think having an array of srvDomains for shards we want to assign to the feeder is reasonable. This could probably look similar to GC_ASSIGNED_SHARDS.

  • Does this program leak memory?

Do you have a good litmus test for this? My first thought would be to run the process for a while and look at the output of prstat -s rss to check whether the RSS runs off. I have tried to ensure that all resources are closed in the state_done method and that the feeder always ends up in that state eventually.

I will make sure that the req object returned from node-moray's findobjects is not leaked. My first thought would be that these are gc'd after they emit the 'end' event.

  • What's the bottleneck?

I will profile this while listing a large directory in my test environment. What tools would you use to determine which part of the code is consuming the most time?

  • It may be quite expensive to use appendSync every time instead of just keeping a file open and appending to it. We can try profiling first and seeing.

I have changed the feeder to use a separate writeStream for each instruction object listing (1 per storage node). The feeder will created these streams when it starts up and close them in done state. Do you think that running in a Manta with many storage nodes (~1,000) might lead to memory issues? I'm not sure how large the objects associated with each stream are, but I could look into that. Do you have a suggestion for how I could find this information?

davepacheco commented 5 years ago

For both the memory leak check and CPU profiling, it may make sense to sanity-check it in a test environment, but the results could be different in prod (particularly for CPU profiling), so I wouldn't necessarily spend a ton of time on it. I think it would be better to prioritize making it possible to iterate in prod.

  • Does this program leak memory?

Do you have a good litmus test for this?

I would just run it for an extended period and see if the RSS usage stabilizes. That's the only way I know to tell.

  • What's the bottleneck? I will profile this while listing a large directory in my test environment. What tools would you use to determine which part of the code is consuming the most time?

I would use prstat to verify whether it's on CPU (and in userland) very much. If so, then I'd profile with:

# dtrace -n profile-97/pid == $YOURPID && arg1/{ @[ustack(80, 8192)] = count(); }' -c 'sleep 60' > stacks.out

This will profile at 97Hz for 60 seconds. Then I'd copy this to your laptop and use stackvis:

# stackvis dtrace flamegraph-d3 < stacks.out > stacks.htm

to generate a flame graph. You can open this in your browser.

There are a few notes about this process:

jmwski commented 5 years ago

I kicked off a test run of the latest commit in us-east (409a14c) in our ops zone there.

I would just run it for an extended period and see if the RSS usage stabilizes. That's the only way I know to tell.

After writing out around 50,000 instructions, the process held a steady RSS of about 64M:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
494351 root       64M   56M sleep    1    0   0:00:09 0.0% node/11

Here's an example of the output of prstat -p <PID> -mLc 1:

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
494351 root     1.4 0.3 0.0 0.0 0.0 0.0  98 0.1  46  35 359   0 node/1
494351 root     0.0 0.1 0.0 0.0 0.0 100 0.0 0.0  18   3  48   0 node/9
494351 root     0.0 0.1 0.0 0.0 0.0 100 0.0 0.0  16   1  48   0 node/7
494351 root     0.0 0.1 0.0 0.0 0.0 100 0.0 0.0  17   1  48   0 node/8
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0  12   0  30   0 node/10
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/11
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/6
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/5
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/4
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/3
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/2
Total: 1 processes, 11 lwps, load averages: 0.03, 0.08, 0.08

Here's the flamegraph: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph.htm. If I'm reading that correctly I think this process is bottlenecked on crc checksum validation when decoding fast message buffers in the node-moray client.

This data is from a process assigned one shard in us-east. Roughly, the process was able to list write 300,000 instructions to a listing file in 45 minutes of execution. To list 1,000,000 instruction objects at this rate would take 150 minutes (around 2.5 hours).

jmwski commented 5 years ago

Here are some datapoints from a lister process running with two assigned shards in us-east:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
544199 root       68M   61M sleep   59    0   0:00:33 0.1% node/11

RSS holds steady at 61M (slightly higher than the process running with 1 shard. This process spends slightly more time in user mode:

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
544199 root     3.6 0.7 0.1 0.0 0.0 0.0  95 0.3 130  84 975   0 node/1
544199 root     0.0 0.2 0.0 0.0 0.0 100 0.0 0.1  65   4 192   0 node/9
544199 root     0.0 0.2 0.0 0.0 0.0 100 0.0 0.1  48  13 142   0 node/10
544199 root     0.0 0.2 0.0 0.0 0.0 100 0.0 0.0  59   2 173   0 node/11
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/8
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/7
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/6
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/5
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/4
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/3
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/2
Total: 1 processes, 11 lwps, load averages: 0.13, 0.16, 0.14

Here's the flamegraph for a process with two shards assigned to it: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph-2-shards.htm.

After 17 minutes of execution, the process juggling 2 shards wrote out 395,549 instructions (this is a sum across multiple storage nodes).

jmwski commented 5 years ago

4-shards User time is substantially higher with 4-shards:

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
561881 root      13 1.8 0.1 0.0 0.0 0.0  84 0.8 898 281  5K   0 node/1
561881 root     0.1 5.9 0.0 0.0 0.0  94 0.0 0.3 274 115 811   0 node/8
561881 root     0.2 5.5 0.0 0.0 0.0  94 0.0 0.4 259 153 737   0 node/9
561881 root     0.1 4.9 0.0 0.0 0.0  95 0.0 0.2 223  49 659   0 node/10
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/11
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/7
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/6
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/5
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/4
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/3
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/2
Total: 1 processes, 11 lwps, load averages: 0.79, 0.47, 0.28

RSS is not significantly affected:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
561881 root       65M   58M sleep   59    0   0:00:08 0.1% node/11

Here's the flamegraph: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph-4-shards.htm

8-shards: There's significantly more USR time here:

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
565813 root      17 2.4 0.2 0.0 0.0 0.0  78 1.7  1K 534  7K   0 node/1
565813 root     0.2 1.9 0.0 0.0 0.0  98 0.0 0.3 366  62  1K   0 node/10
565813 root     0.2 1.8 0.0 0.0 0.0  98 0.0 0.2 355  36  1K   0 node/9
565813 root     0.2 1.7 0.0 0.0 0.0  98 0.0 0.3 344  66 987   0 node/11
565813 root     0.1 1.7 0.0 0.0 0.0  98 0.0 0.3 323  58 926   0 node/8
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/7
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/6
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/5
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/4
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/3
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/2
Total: 1 processes, 11 lwps, load averages: 0.43, 0.28, 0.20

But rss seems to stabilize with not too much growth compared to the 4-shards case:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
565813 root       76M   71M sleep   58    0   0:00:54 0.7% node/11

Here's the flamegraph: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph-8-shards.htm