Calibrate nprocexit to avoid atop coredumps unexpectedly

ShirleyFei commented 1 year ago

There's a case that the shadow file becomes larger than its limit, e.i. (maxshadowrec * acctrecsz), and then suddenly atopacctd.service restarts suddenly.

In current logic supposing atopacctd.service does not restart, we will continue to read a new shadowpath whose sequence is one greater than the fulled one. But this does not work if atopacctd.service restarts, as the new shadowpath's sequence number will be started with zero, leading to a negative value when comparing with the fulled one. And thus return the unsigned long 'nprocexit' as a very huge positive number.

What's more, after acctphotoproc(), we get a wrong acctsize (including reset acctsize to zero), and thus a dummy huge 'curpexit' struct whose partial attributes like gen/cpu/mem/disk are still zeroed. Then later, the same as 'devtstat' struct. Finally, these zeroed attributes will cause a glibc sort() coredump like "*a = 0" during compcpu(), compmem() and so on.

To fix this case,

firstly, return numrecs as zero once detecting atopacctd.service restarts or other bad cases, regarding as an incomplete sample;
secondly, to avoid other bad cases, calibrate nrexit during acctphotoproc(), re-calculate the real 'nprocexit' to avoid the "*a = 0" error.

ShirleyFei commented 1 year ago

@Atoptool Dear maintainer, this patch just return 'nprocexit' as zero to avoid the coredump once bad cases like 'atopacct.service restarts but atop.service not' occur, but does not really calibrate the exited process numbers.

IMHO, to recollect the real 'nprocexit', we have to reacquire the true acctfd, just like the patch: https://github.com/bytedance/atop/tree/recollect-nprocexit-by-reacquire-acctfd does (not a pr yet).

But I have no idea if you are agree with the reacquiring plan, or you believe users have to restart atop.service after atopacct.service restarts and if not we are not responsible for this.

Please help to give some suggestions on this, thanks! (And if the 'reacquiring' is acceptable, I will raise another pr including these two patches then. :)

Have a nice day, thanks

Atoptool commented 1 year ago

The rotation mechanism of the process accounting files is rather complex so I really want to verify what the consequences are. I will have a careful look at this and react afterwards.

Atoptool commented 1 year ago

Sorry for the delay. I will answer next week.

liutingjieni commented 1 year ago

There's a case that the shadow file becomes larger than its limit, e.i. (maxshadowrec * acctrecsz), and then suddenly atopacctd.service restarts suddenly.

In current logic supposing atopacctd.service does not restart, we will continue to read a new shadowpath whose sequence is one greater than the fulled one. But this does not work if atopacctd.service restarts, as the new shadowpath's sequence number will be started with zero, leading to a negative value when comparing with the fulled one. And thus return the unsigned long 'nprocexit' as a very huge positive number.

What's more, after acctphotoproc(), we get a wrong acctsize (including reset acctsize to zero), and thus a dummy huge 'curpexit' struct whose partial attributes like gen/cpu/mem/disk are still zeroed. Then later, the same as 'devtstat' struct. Finally, these zeroed attributes will cause a glibc sort() coredump like "*a = 0" during compcpu(), compmem() and so on.

To fix this case,

firstly, return numrecs as zero once detecting atopacctd.service restarts or other bad cases, regarding as an incomplete sample;

secondly, to avoid other bad cases, calibrate nrexit during acctphotoproc(), re-calculate the real 'nprocexit' to avoid the "*a = 0" error.

There's a case that the shadow file becomes larger than its limit, e.i. (maxshadowrec * acctrecsz), and then suddenly atopacctd.service restarts suddenly.

In current logic supposing atopacctd.service does not restart, we will continue to read a new shadowpath whose sequence is one greater than the fulled one. But this does not work if atopacctd.service restarts, as the new shadowpath's sequence number will be started with zero, leading to a negative value when comparing with the fulled one. And thus return the unsigned long 'nprocexit' as a very huge positive number.

What's more, after acctphotoproc(), we get a wrong acctsize (including reset acctsize to zero), and thus a dummy huge 'curpexit' struct whose partial attributes like gen/cpu/mem/disk are still zeroed. Then later, the same as 'devtstat' struct. Finally, these zeroed attributes will cause a glibc sort() coredump like "*a = 0" during compcpu(), compmem() and so on.

To fix this case,

firstly, return numrecs as zero once detecting atopacctd.service restarts or other bad cases, regarding as an incomplete sample;

secondly, to avoid other bad cases, calibrate nrexit during acctphotoproc(), re-calculate the real 'nprocexit' to avoid the "*a = 0" error.

When reading logs and attempting to continuously scroll through thread information by pressing 'y' and using CTRL + F to page forward, a core error occurs.

The reason is that "nproexit" became 62415, which caused "nprocall" and "ntaskall" to be greater than 62415. When executing the rawread(), the values of devtstat.ntaskall and devtstat.ntaskactive are incorrect. This leads to an error in the value of ncurlist in the generic_samp() when pressing CTRL+F for the second time: ntotal = devtstat->ntaskactive; (showgeneric.c:963) j = ntotal; (showgeneric.c:989) ncurlist = j; (showgeneric.c:1041) This ultimately results in an incorrect value for curlist and ncurlist, leading to a core error when dereferencing curlist in the priproc function.

Atoptool commented 1 year ago

@Atoptool Dear maintainer, this patch just return 'nprocexit' as zero to avoid the coredump once bad cases like 'atopacct.service restarts but atop.service not' occur, but does not really calibrate the exited process numbers.

IMHO, to recollect the real 'nprocexit', we have to reacquire the true acctfd, just like the patch: https://github.com/bytedance/atop/tree/recollect-nprocexit-by-reacquire-acctfd does (not a pr yet).

But I have no idea if you are agree with the reacquiring plan, or you believe users have to restart atop.service after atopacct.service restarts and if not we are not responsible for this.

Please help to give some suggestions on this, thanks! (And if the 'reacquiring' is acceptable, I will raise another pr including these two patches then. :)

Have a nice day, thanks

I do agree with the reacquiring plan because it is not desirable that the atop.service has to be restarted manually at the moment that atopacct.service has been restarted. Could you please raise another PR with the two patches?

ShirleyFei commented 1 year ago

@Atoptool Dear maintainer, this patch just return 'nprocexit' as zero to avoid the coredump once bad cases like 'atopacct.service restarts but atop.service not' occur, but does not really calibrate the exited process numbers. IMHO, to recollect the real 'nprocexit', we have to reacquire the true acctfd, just like the patch: https://github.com/bytedance/atop/tree/recollect-nprocexit-by-reacquire-acctfd does (not a pr yet). But I have no idea if you are agree with the reacquiring plan, or you believe users have to restart atop.service after atopacct.service restarts and if not we are not responsible for this. Please help to give some suggestions on this, thanks! (And if the 'reacquiring' is acceptable, I will raise another pr including these two patches then. :) Have a nice day, thanks

I do agree with the reacquiring plan because it is not desirable that the atop.service has to be restarted manually at the moment that atopacct.service has been restarted. Could you please raise another PR with the two patches?

Sure, I just raised a new https://github.com/Atoptool/atop/pull/254 pr to accomplish the reacquiring plan. Please help to review, thanks. :)

Atoptool commented 1 year ago

PR #254 has been merged.

Atoptool / atop

Calibrate nprocexit to avoid atop coredumps unexpectedly #244