Closed jcrist closed 6 years ago
I am not having any trouble with the code: whether relative or absolute path to the zip, whether usng hdfs_home or not.
I would look in the resource manager or node manager logs to see if there is a java traceback. The only thing I can immediately think of is a permissions error (in the previous version, I'm note sure the file was being uploaded to the right location). Which symlinks are broken, above, and where is the directory.info output coming from?
I would look in the resource manager or node manager logs to see if there is a java traceback.
Nothing stands out in the resource manager logs. I was unable to get the nodemanager logs.
The only thing I can immediately think of is a permissions error (in the previous version, I'm not sure the file was being uploaded to the right location).
In both the failing and successful (old commit) case the file is uploaded to hdfs in the same location with the same permissions.
Which symlinks are broken, above, and where is the directory.info output coming from?
After reading the log a bit more closely (the bash commands that generate it are part of the log), the Broken Symlinks
line is printed first as a header, then broken symlinks are printed below it. In both the working and failing cases there are none.
The directory.info
log is the first bit of every aggregated log - displays the whole file tree in the cwd for each container. In the working case I see the whole contents of the unzipped file, while in the failing case I only see the knit jar and a few static files.
After debugging some more, I'm a bit confused. From reading the code the -1000
exit status is returned if the container is not completed (returns an invalid value, defined here). However, the onContainersCompleted
method should only be called on containers that are completed, and the logs confirm that the state is COMPLETED
. I tried patching knit to also ignore -1000 exit status's, in case those were somehow actually valid, but no luck. Perhaps the record inconsistency here is due to a silenced error elsewhere causing a failure to update? Anyway, this seems to indicate something is odd, but the -1000 value doesn't seem to indicate anything useful.
I am also seeing ExitStatus: -1000 with environments that worked with knit 2.2
btw: -1000 seems to be the exit code that gets assigned to a container before attempting to run any command, and in this case, the command was never executed, so the code remained unchanged. Not all that useful!
Using knit at commit 6c2550cd89d55e9f83233db1f53684e33366478f, the following succeeds:
Using the new knit 0.2.3 release, the following (should be equivalent) starts the application, but the application fails.
Note that this succeeds if I omit the
files
kwarg. Also note that the files are uploaded to the proper location on hdfs, and the log lines reference the proper hdfs locations.For debugging the logs seem to be unhelpful here, I'm not sure if this is just how yarn is or if there is something knit could do to be better for debugging. There are no java tracebacks, the application just fails.
The few lines I found that may help with debugging:
directory.info
ends without ever listing the uploaded files. Note that in both the working version and this version thebroken symlinks
line exists.Container completed ContainerStatus: [...]
line indicates exit code -1000. Googling this code doesn't turn up any results, so I'm not sure what it means.I'm not sure what else I can do to help debug here. For now I'm relying on a custom build of an older commit.