Closed zstumgoren closed 9 years ago
This turned out to be a bug related to how our VirtualBox environment is configured, possibly related to NAT configurations inside guest machines. We seem to have resolved this for the time being. Sorry for the trouble!
Problem description
A remote node is unable to complete processing a DocumentImport Action as part of a standard upload using the dcloud web interface. But that same file is processed successfully if the crowd node is running on the same machine as the crowd server.
This bug is similar (or possibly identical) to #42
Environmental context
We've only encountered this bug in a local virtual environment used as a staging platform for our production deployments.
Debug details
After uploading a document manually through the DCloud web interface, the crowd server appears to successfully allocate work units to a lone remote node. Using debug statements (see below), we've determined that the server appears to be getting back a successful response from the POST request to the node (a.k.a. the process ID of the forked worker on the remote node).
However, the job on the remote node never completes -- no file artifacts are written to disk and the OpCenter reports the job as SPLITTING. This status remains indefinitely until we kill the processes and clean up the db records manually.
As part of the debugging process, we've verified repeatedly that the server can reach the node and the node can reach the server (using telnet, ping and by the fact that the remote node is able to check in initially and work units are distributed to it).
We've been able to get the crowd server to log extra details by sprinkling some print statements into _NodeRecord.send_workunit. We can also hit (and log from) the crowd node's heartbeat/ endpoint (see below). However, we're unable to successfully log any information from the forked Worker processes (again, see below).
Is it possible that the forked process is somehow dying immediately after being spawned??? It's the only thing we can think of that would explain this issue and, most significantly, the fact that no logging is performed by the forked worker even though its PID is returned to the server.
Having said all that, it's also quite possible that we're not correctly debugging the forked process (fwiw, we have been restarting the crowd node and server processes when we manually update the source files with debug statements).
At this point, we're at a bit of a loss and could really use some guidance -- if nothing else, a sanity check on our debugging strategy and possible alternative techniques for isolating the issue.
Any advice/guidance is appreciated.
Thanks!
CODE DEBUG
We dropped logging statements into numerous sections of code in an effort to isolate the problem. We c