dask / knit

Deprecated, please use https://github.com/jcrist/skein or https://github.com/dask/dask-yarn instead
http://knit.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
53 stars 10 forks source link

Knit applications fail if files are uploaded in new release. #111

Closed jcrist closed 6 years ago

jcrist commented 7 years ago

Using knit at commit 6c2550cd89d55e9f83233db1f53684e33366478f, the following succeeds:

knit = Knit(hdfs_home='/tmp/knit')
knit.start('env', env='myenv.zip')

Using the new knit 0.2.3 release, the following (should be equivalent) starts the application, but the application fails.

knit = Knit(hdfs_home='/tmp/knit')
knit.start('env', files=['myenv.zip'])

Note that this succeeds if I omit the files kwarg. Also note that the files are uploaded to the proper location on hdfs, and the log lines reference the proper hdfs locations.

For debugging the logs seem to be unhelpful here, I'm not sure if this is just how yarn is or if there is something knit could do to be better for debugging. There are no java tracebacks, the application just fails.

The few lines I found that may help with debugging:

ls -l:
total 20
-rw-------. 1 yarn hadoop  166 Oct 30 12:13 container_tokens
-rwx------. 1 yarn hadoop  704 Oct 30 12:13 default_container_executor_session.sh
-rwx------. 1 yarn hadoop  758 Oct 30 12:13 default_container_executor.sh
lrwxrwxrwx. 1 yarn hadoop   66 Oct 30 12:13 knit.jar -> /hdfs/disk03/hadoop/yarn/local/filecache/166/knit-1.0-SNAPSHOT.jar
-rwx------. 1 yarn hadoop 4901 Oct 30 12:13 launch_container.sh
drwx--x---. 2 yarn hadoop   10 Oct 30 12:13 tmp
find -L . -maxdepth 5 -ls:
8597569075    4 drwx--x---   3 yarn     hadoop       4096 Oct 30 12:13 .
14182360386    0 drwx--x---   2 yarn     hadoop         10 Oct 30 12:13 ./tmp
8597569076    4 -rw-------   1 yarn     hadoop        166 Oct 30 12:13 ./container_tokens
8597569077    4 -rw-------   1 yarn     hadoop         12 Oct 30 12:13 ./.container_tokens.crc
8599231880    8 -rwx------   1 yarn     hadoop       4901 Oct 30 12:13 ./launch_container.sh
8599231881    4 -rw-------   1 yarn     hadoop         48 Oct 30 12:13 ./.launch_container.sh.crc
8599231882    4 -rwx------   1 yarn     hadoop        704 Oct 30 12:13 ./default_container_executor_session.sh
8599231883    4 -rw-------   1 yarn     hadoop         16 Oct 30 12:13 ./.default_container_executor_session.sh.crc
8599231888    4 -rwx------   1 yarn     hadoop        758 Oct 30 12:13 ./default_container_executor.sh
8599231889    4 -rw-------   1 yarn     hadoop         16 Oct 30 12:13 ./.default_container_executor.sh.crc
12884939857 25320 -r-xr-xr-x   1 yarn     hadoop   25925992 Oct 30 12:13 ./knit.jar
broken symlinks(find -L . -maxdepth 5 -type l -ls):

End of LogType:directory.info

I'm not sure what else I can do to help debug here. For now I'm relying on a custom build of an older commit.

martindurant commented 7 years ago

I am not having any trouble with the code: whether relative or absolute path to the zip, whether usng hdfs_home or not.

I would look in the resource manager or node manager logs to see if there is a java traceback. The only thing I can immediately think of is a permissions error (in the previous version, I'm note sure the file was being uploaded to the right location). Which symlinks are broken, above, and where is the directory.info output coming from?

jcrist commented 7 years ago

I would look in the resource manager or node manager logs to see if there is a java traceback.

Nothing stands out in the resource manager logs. I was unable to get the nodemanager logs.

The only thing I can immediately think of is a permissions error (in the previous version, I'm not sure the file was being uploaded to the right location).

In both the failing and successful (old commit) case the file is uploaded to hdfs in the same location with the same permissions.

Which symlinks are broken, above, and where is the directory.info output coming from?

After reading the log a bit more closely (the bash commands that generate it are part of the log), the Broken Symlinks line is printed first as a header, then broken symlinks are printed below it. In both the working and failing cases there are none.

The directory.info log is the first bit of every aggregated log - displays the whole file tree in the cwd for each container. In the working case I see the whole contents of the unzipped file, while in the failing case I only see the knit jar and a few static files.


After debugging some more, I'm a bit confused. From reading the code the -1000 exit status is returned if the container is not completed (returns an invalid value, defined here). However, the onContainersCompleted method should only be called on containers that are completed, and the logs confirm that the state is COMPLETED. I tried patching knit to also ignore -1000 exit status's, in case those were somehow actually valid, but no luck. Perhaps the record inconsistency here is due to a silenced error elsewhere causing a failure to update? Anyway, this seems to indicate something is odd, but the -1000 value doesn't seem to indicate anything useful.

quartox commented 6 years ago

I am also seeing ExitStatus: -1000 with environments that worked with knit 2.2

martindurant commented 6 years ago

btw: -1000 seems to be the exit code that gets assigned to a container before attempting to run any command, and in this case, the command was never executed, so the code remained unchanged. Not all that useful!