aerospike / act

Aerospike Certification Tool
www.aerospike.com
Other
188 stars 49 forks source link

Error reporting is too noisy #46

Closed Alb0t closed 3 years ago

Alb0t commented 4 years ago

I was running ACT tests in a cloud provider over the weekend.. Today I noticed they are out of disk.

#:~/act/config# grep -c 'Too many open files' act_250x.conf.out
58379616
#:~/act/config# grep -c '^ERROR' act_250x.conf.out
58433002

-rw-r--r-- 1 root root 3.8G Sep 30 16:36 act_250x.conf.out

Okay, I know error logging is important - but this is getting out of hand.

As far as I know, this error isn't helpful. This happens almost every time we push a system to its limits, especially on cloud systems even with us running as root with no file limit set to 1MM.

# ulimit -Hn
1048576

This also causes problems with the latency interpreter, making it run very slow and sometimes giving bad results (I've seen it but don't have replication steps for the issue).

Can we request some changes here? I'm thinking 1) have a flag to ignore open files limit? Or just ignore by default. Is it valuable? 2) Possibly write errors to a separate log file? 3) Could it aggregate errors instead of printing each individual error line? 4) Maybe limit the number of errors reported, like 'repeated 23457x bajillion times' instead of logging each individual problem.

I know the #2/#3 point would make it difficult to troubleshoot ACT so I think just having a flag to ignore this kind of error would be a good thing.

volmarl commented 4 years ago

additional info from @Alb0t

Test seems to run after the errors: Log lines:

ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
ERROR: open device /dev/nvme6n1p3 errno 24 'Too many open files'
after 76068 sec:
requests-queued: 127
reads (37925431131 total)
 (00: 36841027520) (01: 0610748743) (02: 0298053584) (03: 0126809437)
 (04: 0043169632) (05: 0005499635) (06: 0000120535) (07: 0000002045)
device-reads (37925431135 total)
 (00: 36961863750) (01: 0541884396) (02: 0261301071) (03: 0114389853)
 (04: 0040695022) (05: 0005192425) (06: 0000104372) (07: 0000000246)
/dev/nvme0n1p1-reads (1188491167 total)
 (00: 1188319995) (01: 0000135598) (02: 0000029527) (03: 0000006018)
 (04: 0000000024) (05: 0000000003) (06: 0000000002)
/dev/nvme0n1p2-reads (1188468802 total)
 (00: 1188284183) (01: 0000139646) (02: 0000037642) (03: 0000007317)
 (04: 0000000012) (05: 0000000001) (06: 0000000001)
/dev/nvme0n1p3-reads (1188504075 total)
num-devices: 32
service-threads: 2
num-queues: 96
threads-per-queue: 4
test-duration-sec: 86400
report-interval-sec: 1
microsecond-histograms: no
read-reqs-per-sec: 500000
write-reqs-per-sec: 250000
record-bytes: 1536
record-bytes-range-max: 0
large-block-op-kbytes: 128
replication-factor: 1
update-pct: 0
defrag-lwm-pct: 50
disable-odsync: no
commit-to-device: no
commit-min-bytes: 0
tomb-raider: no
tomb-raider-sleep-usec: 0
max-reqs-queued: 1000000
max-lag-sec: 10
scheduler-mode: noop

DERIVED CONFIGURATION
record-stored-bytes: 1536 ... 1536
internal-read-reqs-per-sec: 500000
internal-write-reqs-per-sec: 0
read-req-threads: 2
write-req-threads: 0
large-block-reads-per-sec: 5882.35
large-block-writes-per-sec: 5882.35
kportertx commented 4 years ago

I think we should fail the test immediately when an error occurs in get_fd. The test results are not reliable once this occurs. It wouldn't make sense to 'handle' these errors since we could fail the disk due to a ulimit being too low.

This is the warning you are seeing: https://github.com/aerospike/act/blob/master/src/storage/act_storage.c#L963

Which could be within this call stack: https://github.com/aerospike/act/blob/master/src/storage/act_storage.c#L963

Which is counted as a valid read, which is wrong. As I mentioned above, if he 'handled' the error, we could fail claiming the disk couldn't keep up.

Alb0t commented 4 years ago

@kportertx if this is causing an issue and making the test results inaccurate, i'm all for failing the test. I did notice a drop in the rate when using the -x switch on the interpreter. Is there any way we can come up with a formula for how many file handles would be needed? It would be nice if this could be a param in the actconfig and dropped onto the pid at launch time. Maybe even a formula/auto set based on the rate.

kportertx commented 4 years ago

The formula for max fds should be:

fds = num_devices * 2 + num_queues * threads_per_queue * num_devics

if (tombraider) {
  fds += num_devices
}

From this you should only be using about 13K fds max. Even with the 53K other errors which could have closed an fd, we aren't close to 1M.

Are you sure the process is being allowed 1M fds, what is the current soft-limit on this machine?

ulimit -Sn
Alb0t commented 4 years ago
# ps aux|grep bin/act
root     10068  888  0.0 8096764 26436 pts/1   Sl+  05:59 8331:11 ../target/bin/act_storage act_250x.conf
root     14942  0.0  0.0  14856  1104 pts/0    S+   21:36   0:00 grep --color=auto bin/act

# cat /proc/10068/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             3062796              3062796              processes
Max open files            1024                 1048576              files
Max locked memory         16777216             16777216             bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       3062796              3062796              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Is the soft limit the important parameter here?

kportertx commented 4 years ago

The hard limit is the max a user can set the soft limit without being root. The soft limit is the actual limit for the process.

Your soft limit is 1024 apparently:

Max open files 1024

Alb0t commented 4 years ago

But this is running as root.

kportertx commented 4 years ago

Yes, so root can set the hard-limit, but the root's soft-limit still applies to processes it spawns.

Just wrote a small script to confirm this behavior, couldn't find a man page that specified it though.

Alb0t commented 4 years ago

Confirmed, raising the soft limit seems to help. Didn't expect that. Is it possible for ACT to raise its own soft limit on launch? that would probably be good along with the change you mentioned.

kportertx commented 4 years ago

I believe managing system resource limits is a bit outside the scope of ACT :).

Alb0t commented 4 years ago

nonsense :) but okay deal

gooding470 commented 4 years ago

The one system setting we tried to manage was the scheduler, since in the old days leaving that not set to 'noop' could mysteriously lead to crappy numbers, and take us long to figure out. However now we get lots of complaints about trying to set that from inside ACT (it warns when it can't), and we will likely take that out.

On Tue, Oct 1, 2019 at 12:51 PM Alb0t notifications@github.com wrote:

nonsense :) but okay deal

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aerospike/act/issues/46?email_source=notifications&email_token=AAHIQCIEWCZ2CMG5G32P7NTQMOEZ7A5CNFSM4I35YPZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEACEWOY#issuecomment-537152315, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHIQCI7XQXZMKGU27IN3Q3QMOEZ7ANCNFSM4I35YPZQ .

gooding470 commented 4 years ago

That said, we need to redo ACT again, given Aerospike server 4.7 and the removal of transaction queues and threads. We'll consider all of this again.

On Tue, Oct 1, 2019 at 1:23 PM Andrew Gooding andy@aerospike.com wrote:

The one system setting we tried to manage was the scheduler, since in the old days leaving that not set to 'noop' could mysteriously lead to crappy numbers, and take us long to figure out. However now we get lots of complaints about trying to set that from inside ACT (it warns when it can't), and we will likely take that out.

On Tue, Oct 1, 2019 at 12:51 PM Alb0t notifications@github.com wrote:

nonsense :) but okay deal

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aerospike/act/issues/46?email_source=notifications&email_token=AAHIQCIEWCZ2CMG5G32P7NTQMOEZ7A5CNFSM4I35YPZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEACEWOY#issuecomment-537152315, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHIQCI7XQXZMKGU27IN3Q3QMOEZ7ANCNFSM4I35YPZQ .