Closed sebenste closed 3 years ago
Are there any relevant messages in the LDM log file?
Unfortunately, no.
Gilbert
From: Steven Emmerson @.> Sent: Saturday, March 27, 2021 12:27 PM To: Unidata/LDM @.> Cc: Gilbert Sebenste @.>; Author @.> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)
Are there any relevant messages in the LDM log file?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Unidata/LDM/issues/89#issuecomment-808765924, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLWO2MTEJW76FGSUWHNWN3TFYIOBANCNFSM4Z4QBRPA.
Is there any evidence of what the problem might be? Does an "ldmadmin restart" indicate anything?
No. That’s why this has been going on for…well, decades. I can’t reproduce it consistently, and it doesn’t happen often. All I know is, to trip the bug, you have to do an ldmadmin restart. Sometimes it takes one time to do, sometimes a bunch of times, and then sometimes, never. I can’t see any pattern to this.
An ldmadmin restart doesn’t indicate anything, but the queue becomes corrupt minutes to hours (or even a day) after it gets restarted. That’s the crazy thing about this…
From: Steven Emmerson @.> Sent: Monday, March 29, 2021 12:22 PM To: Unidata/LDM @.> Cc: Gilbert Sebenste @.>; Author @.> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)
Is there any evidence of what the problem might be? Does an "ldmadmin restart" indicate anything?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Unidata/LDM/issues/89#issuecomment-809559682, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLWO2N3EQGCV6TKCG4ORADTGCZLFANCNFSM4Z4QBRPA.
How have you determined that the queue becomes corrupt?
Products stop writing to the disk. Whenever there is queue corruption, that always happens. And, remaking the queue and restarting the LDM always fixes it, without exception.
Gilbert
From: Steven Emmerson @.> Sent: Monday, March 29, 2021 1:42 PM To: Unidata/LDM @.> Cc: Gilbert Sebenste @.>; Author @.> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)
How have you determined that the queue becomes corrupt?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Unidata/LDM/issues/89#issuecomment-809619998, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLWO2PN3EBWD6W75FYFOYLTGDCYXANCNFSM4Z4QBRPA.
Do you have any processes running outside of the LDM's process group (i.e., not executed by an EXEC entry in the LDM configuration-file) that insert products into the queue?
It happens on servers that do and do not insert products into the queue. It will even do it on our NOAAport ingester at our dish.
Gilbert From: Steven Emmerson @.> Sent: Tuesday, March 30, 2021 9:40 AM To: Unidata/LDM @.> Cc: Gilbert Sebenste @.>; Author @.> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)
Do you have any processes running outside of the LDM's process group (i.e., not executed by an EXEC entry in the LDM configuration-file) that insert products into the queue?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Unidata/LDM/issues/89#issuecomment-810317189, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLWO2JWBVETUGUY2KJ53RTTGHPD5ANCNFSM4Z4QBRPA.
Products stop writing to the disk. Whenever there is queue corruption, that always happens.
If products stop being written to disk, then there should be at least one associated log message (for example, indicating that a pqact(1) process terminated). Would you please check again.
Unfortunately, the log got wiped out now so I can’t check. But, I will double check when it happens again. Sorry about that…
Gilbert
From: Steven Emmerson @.> Sent: Tuesday, March 30, 2021 12:31 PM To: Unidata/LDM @.> Cc: Gilbert Sebenste @.>; Author @.> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)
Products stop writing to the disk. Whenever there is queue corruption, that always happens.
If products stop being written to disk, then there should be at least one associated log message (for example, indicating that a pqact(1) process terminated). Would you please check again.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Unidata/LDM/issues/89#issuecomment-810443802, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLWO2IBZKKXUFM7UIV7XH3TGIDDZANCNFSM4Z4QBRPA.
This was fixed in 6.13.14. Closing ticket.
OS: Centos 7, fully updated
This bug has actually been around for many years, but I was hoping it was vanquished. I guess not...
We were bit by this bug this evening, When executing an "ldmadmin restart", it has an issue whereby upon starting up, the LDM will start normally, except within hours, the queue will become corrupt, and data won't write to physical media. You must then stop the LDM, remake the queue, and start it again. This fixes the issue. Every time you delete the queue after you stop it, upon restart, everything is fine. But, occasionally, if you only do an "ldmadmin restart", the queue becomes corrupt. This is more likely to happen, in my experience, if:
You are running a high-volume, high file-size count feed (think Level2 radar or CONDUIT) If you do multiple restarts of the LDM, spaced hours or more apart
This does NOT happen, ever, if the queue is deleted and remade before restarting the LDM, even if you do this:
ldmadmin stop ldmadmin clean ldmadmin delqueue ldmadmin restart
Or this:
ldmadmin stop ldmamdin delqueue ldmadmin mkqueue ldmadmin start
It only happens when doing a straight "ldmadmin restart" command, nothing before or after it. Furthermore, it may not happen until hours after a restart.