daisy / pipeline

Super-project that aggregates all Pipeline related code, provides a common tracker for Pipeline related issues and holds the Pipeline website
http://daisy.github.io/pipeline
21 stars 21 forks source link

Felix error after web service hung and restarted #195

Open ghost opened 10 years ago

ghost commented 10 years ago

From john.bru...@gmail.com on July 19, 2012 17:39:02

What steps will reproduce the problem?

  1. Start the Pipeline in remote mode, in the background (bin/pipeline2 remote &)
  2. Run conversions through dtbook-to-epub3 until the service gets stuck (this is the step that is hard to reproduce reliably)
  3. Stop the web service so that it can shut down gracefully (kill -15 pid)
  4. Start up the web service again
  5. Run another conversion of a book that previously succeeded through dtbook-to-epub3.

What is the expected output? What do you see instead?

I expect a book that was converted successfully to be successful again. Instead, if the web service had gotten hung up by another book, then after the restart, the conversion fails. Somewhere along the conversion process there will be an error like this (more complete stack trace in the attached log file):

2012-07-18 23:26:13,509 [INFO ] o.d.c.xproc.calabash.steps.Message - bundle://56.0:1/xml/upgrade-dtbo
ok/upgrade-dtbook.xpl:90:25:Message:File is already the most recent version: 2005-3
2012-07-18 23:26:13,672 [INFO ] o.d.c.xproc.calabash.steps.Message - bundle://56.0:1/xml/upgrade-dtbo
ok/upgrade-dtbook.xpl:90:25:Message step !1.8.4.2 read file:/opt/pipeline2/current/data/7b0d123f-23da
-4926-a35b-e2d8389d9b7a/context/The_Lord_of_the_Rings.xml
2012-07-18 23:26:13,675 [DEBUG] o.d.p.m.resolver.ModuleUriResolver - No module found for uri:bundle:/
/56.0:1/xml/upgrade-dtbook/schema/dtbook-2005-3.rng
2012-07-18 23:26:19,027 [DEBUG] o.d.p.m.resolver.ModuleUriResolver - No module found for uri:bundle:/
/54.0:1/xml/schema/dtbook-2005-3.rng
2012-07-18 23:26:21,612 [DEBUG] o.d.p.m.resolver.ModuleUriResolver - No module found for uri:bundle:/
/54.0:1/xml/rename-to-span.xsl
2012-07-18 23:26:26,945 [ERROR] org.daisy.pipeline.job.Job - job finished with error state
java.lang.NullPointerException: null
        at org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelegation(BundleWiringIm
pl.java:1432) ~[felix.jar:na]
        at org.apache.felix.framework.BundleWiringImpl.access$400(BundleWiringImpl.java:72) ~[felix.j
ar:na]
        at org.apache.felix.framework.BundleWiringImpl$BundleClassLoader.loadClass(BundleWiringImpl.j
ava:1843) ~[felix.jar:na]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:266) ~[na:1.6.0_24]
        at sun.misc.Unsafe.defineClass(Native Method) ~[na:1.6.0_24]
        at sun.reflect.ClassDefiner.defineClass(ClassDefiner.java:63) ~[na:1.6.0_24]

The only reference to this kind of error I found in Googling was an existing issue with Felix, https://issues.apache.org/jira/browse/FELIX-3477.

What version of the product are you using? On what operating system?

Pipeline2 v1.3-beta, running on Ubuntu 11

Please provide any additional information below.

The attached log file is the console log over the course of many restarts. The error can be seen first happening after the service is restarted at timestamp 19:06:09, following a point where the service was hung for half an hour in the middle of a job.

Getting the web service to hang is easy to reproduce in the sense that it happens fairly often, after just a few books sent through, but hard to reproduce in that it may not happen in the same place, with the same book. One situation that appears could trigger this problem is sending through a large book that exhausts the heap.

What we have found as a workaround is to delete all of the files in data/felix-cache before restarting after a hung system. When we do this, we do not see the Felix errors.

Attachment: daisy-pipeline-felix-error.log

Original issue: http://code.google.com/p/daisy-pipeline/issues/detail?id=195

ghost commented 10 years ago

From rdeltour@gmail.com on September 03, 2012 14:06:24

Status: Accepted

ghost commented 10 years ago

From rdeltour@gmail.com on September 07, 2012 05:09:52

I did face the same issue recently, although I can't reproduce it easily. I'm still interested in a systematic repro if anyone can come up with one.

ghost commented 10 years ago

From rdeltour@gmail.com on September 07, 2012 05:33:11

Yay, I did find a repro after all:

  1. start the pipeline service in developer mode (with the shell bundles)
  2. find the ID of the "DAISY Pipeline 2 :: Push Notifier " bundle (using the ps felix shell command)
  3. update the bundle (e.g. update 45)
  4. run a script

The issue is probably that the bundle forgets to unregister itself from the EventBus when it stops, and when running the script later the bus tries to dispatch the message to load the class to a disposed bundle wiring.

We need to double check every activate/deactivate service methods to make sure that earlier registrations are properly unregistered.

ghost commented 10 years ago

From rdeltour@gmail.com on September 07, 2012 05:37:02

Owner: rdeltour@gmail.com

bertfrees commented 7 years ago

Benetech have found a workaround so this issue isn't super important. However it seems Romain has found a repro so it doesn't take much effort to check if it is still an issue. If possible it would be good to add a Pax-Exam test.