Open davepacheco opened 5 years ago
- The sqlite update and write to the data files should probably happen once per batch, not per record. I think this will have a pretty major impact on both CPU usage and performance due to not having to fsync the sqlite file for each path written.
I've added a checkpoint prototype function that is called in once per 'end' event received from findObjects.
- I'd start with a batch size of 5K or 10K. The data path is limited to 1K, but I think we can push this a little further.
Agreed, this will remain the default batch size.
- I'd start f_delay = 0. It's good that we have this option in there, but we should have just one client per shard, and having one request outstanding all the time seems reasonable.
Done.
- The shard number needs to appear somewhere in the database, either in the table itself or in the name of the database file. Otherwise, concurrent instances of this program will stomp on each other. I'd probably recommend putting this into the database file name for reasons described later. (Another option would be to take the database name as an argument and putting the shard name into the table.)
Database filenames are now formatted as in 2.moray.orbit.example.com-stream_position.db
. Each process will have one client per shard. That client will be the exclusive writer/reader from this database.
- How many shards will we assign to each process?
I've made this configurable by adding a shards
array to the config file accepted by the feeder.
- How will we orchestrate the processes?
I think having an array of srvDomains for shards we want to assign to the feeder is reasonable. This could probably look similar to GC_ASSIGNED_SHARDS
.
- Does this program leak memory?
Do you have a good litmus test for this? My first thought would be to run the process for a while and look at the output of prstat -s rss
to check whether the RSS runs off. I have tried to ensure that all resources are closed in the state_done
method and that the feeder always ends up in that state eventually.
I will make sure that the req
object returned from node-moray's findobjects
is not leaked. My first thought would be that these are gc'd after they emit the 'end' event.
- What's the bottleneck?
I will profile this while listing a large directory in my test environment. What tools would you use to determine which part of the code is consuming the most time?
- It may be quite expensive to use appendSync every time instead of just keeping a file open and appending to it. We can try profiling first and seeing.
I have changed the feeder to use a separate writeStream for each instruction object listing (1 per storage node). The feeder will created these streams when it starts up and close them in done state. Do you think that running in a Manta with many storage nodes (~1,000) might lead to memory issues? I'm not sure how large the objects associated with each stream are, but I could look into that. Do you have a suggestion for how I could find this information?
For both the memory leak check and CPU profiling, it may make sense to sanity-check it in a test environment, but the results could be different in prod (particularly for CPU profiling), so I wouldn't necessarily spend a ton of time on it. I think it would be better to prioritize making it possible to iterate in prod.
- Does this program leak memory?
Do you have a good litmus test for this?
I would just run it for an extended period and see if the RSS usage stabilizes. That's the only way I know to tell.
- What's the bottleneck? I will profile this while listing a large directory in my test environment. What tools would you use to determine which part of the code is consuming the most time?
I would use prstat
to verify whether it's on CPU (and in userland) very much. If so, then I'd profile with:
# dtrace -n profile-97/pid == $YOURPID && arg1/{ @[ustack(80, 8192)] = count(); }' -c 'sleep 60' > stacks.out
This will profile at 97Hz for 60 seconds. Then I'd copy this to your laptop and use stackvis:
# stackvis dtrace flamegraph-d3 < stacks.out > stacks.htm
to generate a flame graph. You can open this in your browser.
There are a few notes about this process:
stackvis dtrace flamegraph-svg < stacks.out > stacks.svg
._start
in libc or the like), then you may need to increase the parameters to ustack (which control how many stack frames to record and how much buffer space to use, respectively).I kicked off a test run of the latest commit in us-east (409a14c) in our ops zone there.
I would just run it for an extended period and see if the RSS usage stabilizes. That's the only way I know to tell.
After writing out around 50,000 instructions, the process held a steady RSS of about 64M:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
494351 root 64M 56M sleep 1 0 0:00:09 0.0% node/11
Here's an example of the output of prstat -p <PID> -mLc 1
:
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
494351 root 1.4 0.3 0.0 0.0 0.0 0.0 98 0.1 46 35 359 0 node/1
494351 root 0.0 0.1 0.0 0.0 0.0 100 0.0 0.0 18 3 48 0 node/9
494351 root 0.0 0.1 0.0 0.0 0.0 100 0.0 0.0 16 1 48 0 node/7
494351 root 0.0 0.1 0.0 0.0 0.0 100 0.0 0.0 17 1 48 0 node/8
494351 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 12 0 30 0 node/10
494351 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/11
494351 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/6
494351 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/5
494351 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/4
494351 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/3
494351 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/2
Total: 1 processes, 11 lwps, load averages: 0.03, 0.08, 0.08
Here's the flamegraph: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph.htm. If I'm reading that correctly I think this process is bottlenecked on crc checksum validation when decoding fast message buffers in the node-moray client.
This data is from a process assigned one shard in us-east. Roughly, the process was able to list write 300,000 instructions to a listing file in 45 minutes of execution. To list 1,000,000 instruction objects at this rate would take 150 minutes (around 2.5 hours).
Here are some datapoints from a lister process running with two assigned shards in us-east:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
544199 root 68M 61M sleep 59 0 0:00:33 0.1% node/11
RSS holds steady at 61M (slightly higher than the process running with 1 shard. This process spends slightly more time in user mode:
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
544199 root 3.6 0.7 0.1 0.0 0.0 0.0 95 0.3 130 84 975 0 node/1
544199 root 0.0 0.2 0.0 0.0 0.0 100 0.0 0.1 65 4 192 0 node/9
544199 root 0.0 0.2 0.0 0.0 0.0 100 0.0 0.1 48 13 142 0 node/10
544199 root 0.0 0.2 0.0 0.0 0.0 100 0.0 0.0 59 2 173 0 node/11
544199 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/8
544199 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/7
544199 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/6
544199 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/5
544199 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/4
544199 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/3
544199 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/2
Total: 1 processes, 11 lwps, load averages: 0.13, 0.16, 0.14
Here's the flamegraph for a process with two shards assigned to it: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph-2-shards.htm.
After 17 minutes of execution, the process juggling 2 shards wrote out 395,549 instructions (this is a sum across multiple storage nodes).
4-shards User time is substantially higher with 4-shards:
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
561881 root 13 1.8 0.1 0.0 0.0 0.0 84 0.8 898 281 5K 0 node/1
561881 root 0.1 5.9 0.0 0.0 0.0 94 0.0 0.3 274 115 811 0 node/8
561881 root 0.2 5.5 0.0 0.0 0.0 94 0.0 0.4 259 153 737 0 node/9
561881 root 0.1 4.9 0.0 0.0 0.0 95 0.0 0.2 223 49 659 0 node/10
561881 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/11
561881 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/7
561881 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/6
561881 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/5
561881 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/4
561881 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/3
561881 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/2
Total: 1 processes, 11 lwps, load averages: 0.79, 0.47, 0.28
RSS is not significantly affected:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
561881 root 65M 58M sleep 59 0 0:00:08 0.1% node/11
Here's the flamegraph: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph-4-shards.htm
8-shards: There's significantly more USR time here:
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
565813 root 17 2.4 0.2 0.0 0.0 0.0 78 1.7 1K 534 7K 0 node/1
565813 root 0.2 1.9 0.0 0.0 0.0 98 0.0 0.3 366 62 1K 0 node/10
565813 root 0.2 1.8 0.0 0.0 0.0 98 0.0 0.2 355 36 1K 0 node/9
565813 root 0.2 1.7 0.0 0.0 0.0 98 0.0 0.3 344 66 987 0 node/11
565813 root 0.1 1.7 0.0 0.0 0.0 98 0.0 0.3 323 58 926 0 node/8
565813 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/7
565813 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/6
565813 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/5
565813 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/4
565813 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/3
565813 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 node/2
Total: 1 processes, 11 lwps, load averages: 0.43, 0.28, 0.20
But rss seems to stabilize with not too much growth compared to the 4-shards case:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
565813 root 76M 71M sleep 58 0 0:00:54 0.7% node/11
Here's the flamegraph: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph-8-shards.htm
I'm filing this issue as a place to put feedback from my initial review. I reviewed commit d5fb9d6f5f17f95773eebf846f16120f9423b3ea.
First, this is great -- a lot of work for a short time! Most of my major feedback relates to performance and operational considerations.
Here are the major issues I think we definitely want to address:
Here are the operational considerations we'll need to answer:
From the above: I think it would be safest if we can deploy this to a new CN on the Manta network.
I'm not sure how big an issue this will be: