Closed akrherz closed 9 years ago
Daryl,
The modification times of the ".state" files are a red herring because the files are written when the LDM shuts down.
Each file identifies the last data-product that was successfully processed by the pqact(1) process associated with the similarly-named configuration-file (which is why pqact(1) processes and their configuration-files need to be one-to-one). pqact(1) uses this information to start processing where it left off (assuming the identified data-product is still in the product-queue).
Consequently, you shouldn't have encountered the problem you did unless something else was wrong.
Were "processed oldest product" messages being written to the LDM log file prior to this? This would violate the assumption at the end of the second paragraph.
Regards, Steve Emmerson
On Thu, May 7, 2015 at 6:43 AM, daryl herzmann notifications@github.com wrote:
With the current release, all pqact processes appear to write the same .state file and not a .state file that is accurate to that pqact process. For example, on a machine that has 9 pqact process, the written out .state file for each at shutdown time has the same timestamp!
Here's the problem. Yesterday, my system feeds NEXRAD2 from idd.unidata.ucar.edu and other feeds from idd.ssec.wisc.edu. The NEXRAD2 feed became very latent, with a feed latency at shutdown time of ~30 minutes. I have 7 pqact processes that only look at that feedtype, so their queue time should have been 30 minutes in the past. The IDS feedtype was current, so this resulted in .state files with the current timestamp . So when I started LDM back up, the 7 pqact processes ignored the incoming new NEXRAD2 data because the pqinsert times were within that recent 30 minute window. After some cussing, I realized what had happened, stopped LDM, removed the state files and started LDM again. Of course, data was missed during this period :(
Please modify pqact to write state files that represent where exactly that pqact process is within the LDM queue so that I don't loose data during restarts. Even in the normal case of a machine with multiple pqact processes, this would result in products being missed.
— Reply to this email directly or view it on GitHub https://github.com/Unidata/LDM/issues/36.
Hi Steve,
I am not referencing the file modification times, I am talking about the actual contents of the file.
[ldm@mtarchive-public etc]$ diff pqact-craft.conf.state pqact.conf.state
[ldm@mtarchive-public etc]$ $ cat *.state | grep -v '#'
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
[ldm@mtarchive-public etc]$ ps uax | grep pqact
ldm 21448 1.3 14.5 4035580 3583072 ? S May06 13:38 pqact /home/ldm/etc/pqact.conf
ldm 21449 0.4 14.5 4035052 3573024 ? S May06 4:54 pqact -p BZIP2/K[A-D] -f CRAFT /home/ldm/etc/pqact-craft.conf
ldm 21450 0.4 13.9 4035052 3434408 ? S May06 4:22 pqact -p BZIP2/K[E-H] -f CRAFT /home/ldm/etc/pqact-craft2.conf
ldm 21451 0.3 12.7 4035052 3130216 ? S May06 3:11 pqact -p BZIP2/K[I-K] -f CRAFT /home/ldm/etc/pqact-craft3.conf
ldm 21452 0.4 14.2 4035052 3502000 ? S May06 4:39 pqact -p BZIP2/K[L-O] -f CRAFT /home/ldm/etc/pqact-craft4.conf
ldm 21453 0.3 12.5 4035052 3085248 ? S May06 3:21 pqact -p BZIP2/K[P-R] -f CRAFT /home/ldm/etc/pqact-craft5.conf
ldm 21454 0.4 14.1 4035052 3469616 ? S May06 4:12 pqact -p BZIP2/K[S-Z] -f CRAFT /home/ldm/etc/pqact-craft6.conf
ldm 21455 0.2 9.9 4035052 2440172 ? S May06 2:39 pqact -p BZIP2/[A-J] -f CRAFT /home/ldm/etc/pqact-craft7.conf
ldm 21456 0.3 12.5 4035052 3096664 ? S May06 3:27 pqact -p BZIP2/[L-Z] -f CRAFT /home/ldm/etc/pqact-craft8.conf
Were "processed oldest product" messages being written to the LDM log file prior to this?
no.
That shouldn't be possible -- so an assumption must be invalid.
Would you please post the output of the following commands -- executed by the LDM user in their etc
subdirectory:
ls -l *.state*
ls -l *pqact*.conf
grep -i '^exec.*pqact' ldmd.conf
Hi Steve, every system I check has this same thing, each .state file has the same content timestamp within. Do your systems not show this?
[ldm@mtarchive-public etc]$ ls -l *.state*
-rw-rw-r-- 1 ldm ldm 183 May 6 17:48 pqact.conf.state
-rw-rw-r-- 1 ldm ldm 183 May 6 17:48 pqact-craft2.conf.state
-rw-rw-r-- 1 ldm ldm 183 May 6 17:48 pqact-craft3.conf.state
-rw-rw-r-- 1 ldm ldm 183 May 6 17:48 pqact-craft4.conf.state
-rw-rw-r-- 1 ldm ldm 183 May 6 17:48 pqact-craft5.conf.state
-rw-rw-r-- 1 ldm ldm 183 May 6 17:48 pqact-craft6.conf.state
-rw-rw-r-- 1 ldm ldm 183 May 6 17:48 pqact-craft7.conf.state
-rw-rw-r-- 1 ldm ldm 183 May 6 17:48 pqact-craft8.conf.state
-rw-rw-r-- 1 ldm ldm 183 May 6 17:48 pqact-craft.conf.state
[ldm@mtarchive-public etc]$ ls -l *pqact*.conf
lrwxrwxrwx 1 ldm ldm 33 Jan 19 14:18 pqact.conf -> ../ldmconfig/mtarchive/pqact.conf
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft2.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft3.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft4.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft5.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft6.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft7.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft8.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:21 pqact-craft.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
[ldm@mtarchive-public etc]$ grep -i '^exec.*pqact' ldmd.conf
EXEC "pqact /home/ldm/etc/pqact.conf"
exec "pqact -p BZIP2/K[A-D] -f CRAFT /home/ldm/etc/pqact-craft.conf"
exec "pqact -p BZIP2/K[E-H] -f CRAFT /home/ldm/etc/pqact-craft2.conf"
exec "pqact -p BZIP2/K[I-K] -f CRAFT /home/ldm/etc/pqact-craft3.conf"
exec "pqact -p BZIP2/K[L-O] -f CRAFT /home/ldm/etc/pqact-craft4.conf"
exec "pqact -p BZIP2/K[P-R] -f CRAFT /home/ldm/etc/pqact-craft5.conf"
exec "pqact -p BZIP2/K[S-Z] -f CRAFT /home/ldm/etc/pqact-craft6.conf"
exec "pqact -p BZIP2/[A-J] -f CRAFT /home/ldm/etc/pqact-craft7.conf"
exec "pqact -p BZIP2/[L-Z] -f CRAFT /home/ldm/etc/pqact-craft8.conf"
Steve, I wonder about pqact/pqact.c, it appears current Cursor gets updated for each product LDM receives, regardless of if it was processed by pqact or not, since the status is always 0 from processProduct. That may explain why currentCursor would be the same for all pqact processes?
You appear to be right. processProduct()
always returns 0, which will cause the last data-product matching the -f feed
and -p pattern
options of pqact(1) to be used regardless of whether or not it matched any pqact(1)
configuration-file entries.
The workaround, therefore, is to set the -f feed
and -p pattern
options of all pqact(1)
entries in the LDM configuration-file to be consonant with the entries in the corresponding pqact(1)
configuration-file.
I'll see about fixing this in the next release.
Thanks for sending this in.
Thanks Steve, I am unsure of your workaround. Do you have any systems that do not exhibit this behavior currently? I added -p .* -f ANY
to the pqact process above that currently does not have those options set, but there was no change in the .state file result at restart.
Steve, maybe there is another bug here. I don't see how setting -p pattern
on the pqact process causes it to limit the number of products exposed to its conf file pattern actions? It seems like they all get evaluated by pq_sequence()
logic?
EDIT: yes, but prodInClass(clss, info)
will avoid the ifMatch
logic. Interesting, but the result is that every product ends up bumping the currentCursor
regardless of it if matches the command line args
I've modified processProduct()
so that it will only return 0
if the data-product was actually processed. This will prevent currentCursor
from being modified by data-products that merely match the -f
and -p
option arguments.
I wonder if this is the right thing to do, however. Should the "last processed product" attribute be determined by the -f
and -p
options or by configuration-file entries?
Steve, indeed, I am boggling this too. IMO, pqact
's state file should contain the queue insert time of the last processed product that matched a pqact.conf
entry. This is the point at which is should pick up processing again. If new pqact.conf
entries are added that process what would have been skipped products from a previous pqact.conf
entry, so be it. I think. I am still boggling pq_sequence()
and returning the problematic 0
and not processProduct()
I am still boggling
pq_sequence()
and returning the problematic0
and notprocessProduct()
I'm afraid I don't know what that means.
Steve, I believe the bug is located at line 672 of pqact/pqact.c
, here pq_sequence()
returns 0 regardless of if the product
any product will then bump currentCursor
maybe the solution then is to add another cursor to keep track of last-proccessed-product and then persist that value to the .state file?
It looks like pq_sequence()
returns 0
if the data-product doesn't match the data-product class specification.
I think the solution is to move the setting of currentCusor
into processProduct()
so that it's only modified if the data-product is processed.
Hi Steve, sounds to be good to me... I am been looking through the code more and it seems that thepqact
state file has two purposes.
Perhaps an immediate workaround here is to set pqact
flags -o
and -t
on startup. This would effectively decrement the timestamp found in the state file back to allow any new products that arrive from the upstream latent host to be processed. What do you think?
I'd have to analyze the code to determine if that's a viable workaround -- and if I'm going to do that, then I'd rather just fix the code.
It's worth a try, however.
Fixed in the next release.
Hi Steve, I rolled out 6.12.10 to one LDM host today and immediately started seeing lots of logs like this:
May 20 15:58:59 mesonet-director1 journal: pqact[13363] WARN: Processed oldest product in queue: 629.666 s
I reverted back to 6.12.9 and the messages are not happening anymore. I am wondering if this is related to this issue or perhaps the bug mentioned by Gilbert in the CHANGE_LOG.
Yeah. There seems to be a problem with the logic behind this fix. I'm working on it.
LDM 6.12.10 has a bug. Wait for version 6.12.11, which should be out today.
With the current release, all pqact processes appear to write the same .state file and not a .state file that is accurate to that pqact process. For example, on a machine that has 9 pqact process, the written out .state file for each at shutdown time has the same timestamp!
Here's the problem. Yesterday, my system feeds NEXRAD2 from idd.unidata.ucar.edu and other feeds from idd.ssec.wisc.edu. The NEXRAD2 feed became very latent, with a feed latency at shutdown time of ~30 minutes. I have 7 pqact processes that only look at that feedtype, so their queue time should have been 30 minutes in the past. The IDS feedtype was current, so this resulted in .state files with the current timestamp . So when I started LDM back up, the 7 pqact processes ignored the incoming new NEXRAD2 data because the pqinsert times were within that recent 30 minute window. After some cussing, I realized what had happened, stopped LDM, removed the state files and started LDM again. Of course, data was missed during this period :(
Please modify pqact to write state files that represent where exactly that pqact process is within the LDM queue so that I don't loose data during restarts. Even in the normal case of a machine with multiple pqact processes, this would result in products being missed.