Closed JessicaLopezEspejel closed 2 years ago
Hi @JessicaLopezEspejel, not the author here. Just stumbled across your issue. So not sure if I am qualified to help. But the error message suggests that Python is having a hard time finding the client module. This could be due to missing environment variables, for example if you are running Picard outside of the provided docker-containers.
Check if the following environment variables are set and if not, adjust them to your setup and set them:
export PATH=/home/toolkit/.local/bin:$PATH
export PYTHONPATH=/app:/app/gen-py3
export LD_LIBRARY_PATH=/app/third_party/tokenizers/target/release:/app/gen-py3/picard
export BOOTSTRAP_HASKELL_NONINTERACTIVE=yes
export BOOTSTRAP_HASKELL_NO_UPGRADE=yes
export GHCUP_USE_XDG_DIRS=yes
export GHCUP_INSTALL_BASE_PREFIX=/app
export CABAL_DIR=/app/.cabal
export PATH=/app/.cabal/bin:/app/.local/bin:$PATH
What I am trying to do is related. @tscholak is there an easy way to run the Picard Server/Client as a standalone-service? For example to use it for multiple parallel evaluations. What has worked for me so far was to set one long-running evaluation with "launch_picard": true,
and all others (shorter) with "launch_picard": false,
. This is not a nice solution but seemed like a quick proof of concept to test if parallel processing would work.
When trying to launch the client standalone I got stuck at Thrift because I have never used it before. So I am currently looking into that. But maybe there is a simpler way to launch it?
Hi @JessicaLopezEspejel and @constantin-huetterer, author here.
This could be due to missing environment variables, for example if you are running Picard outside of the provided docker-containers.
Missing env variables but also missing non-python dependencies. The toolchain for facebook's thrift library is long and complicated. I had to work long and hard to package it in the docker images, and I recommend to anyone to stick to them.
@constantin-huetterer
is there an easy way to run the Picard Server/Client as a standalone-service?
Do you want to deploy the Picard executable independently from the python code and on different hosts?
Hello @tscholak,
thank you for the quick reply! I am currently testing a new dataset released by researchers of our University as part of my thesis. Therefore the Uni provided me with a virtual machine that is equipped with multiple GPUs. I tried to maximize throughput on the different evaluation-splits that we have by starting the split with most examples first and launching Picard. For the other splits I disabled the launch of Picard to see if they would use the same endpoint (on the same host) during the processing of the beams for parallel evaluation. This seems to work well. To decouple Picard further (from the longest running evaluation), I could probably use the serve python script as a base and modify it a little, so that it would not start the HTTP endpoint but still launch Picard. I would have tried this next. However, one of the evaluations crashed. Now a semaphore seems to have leaked and is remaining in the system somewhere, blocking Picard/Thrift from starting. Error-Message: <TransportErrorType.UNKNOWN: 0>, 'Channel is !good()'
. I have seen this message in other issues but found no solution.
I have tried to debug this for a rather long time yesterday, hoping I could find and manually clean up the leaked Semaphore somewhere in the SysV / Posix parallelization mechanism. I believe that the problem would be solved by a reboot of the container. Unfortunately I don't have permission to do that. This is why I tried to dive deeper into what Picard is doing behind the scenes.
To summarize: Crashed Picard. Semaphore got leaked. Because I couldn't restart the Docker container, my main intention was to have a better understanding of what is going behind the scenes. When the error is thrown, it looks like Picard didn't start. There is no endpoint under port 9090 to talk to. So I was looking for a way to start Picard without the Python Launcher and see if I could narrow down the source of the error.
PS: If you would like and my supervisors give the green light, I can open a pull request that adds the new dataset to this repo.
<TransportErrorType.UNKNOWN: 0>, 'Channel is !good()'
That means that thrift client and server cannot communicate. This can be caused by a number of things. Usually it's because the server isn't running.
--
Based on your use case, I can see a number of solutions:
Run multiple eval jobs concurrently and one picard server per job. It is possible to run more than one thrift server per host if every server uses a different port. However, unfortunately, for picard, I have hardcoded the port to 9090
, https://github.com/ServiceNow/picard/blob/6a252386bed6d4233f0f13f4562d8ae8608e7445/picard/exe/Main.hs#L649, and that means that the port will be in use once the first picard server starts. Someone needs to implement a change that this port becomes configurable. I don't know when/if I will have time for that. Contributions welcome.
Run one eval job only and use all GPUs for it. You can tell hf/pytroch to use all gpus for inference. That will increase your effective eval batch size bs
to n * bs
for n
GPUs, and that will speed up evaluation by a factor of n
. the picard server will work just fine with this.
Run multiple eval jobs and only one picard server that is shared by all of them. You can start the picard
executable from a terminal, say, in tmux
, and then launch a bunch of eval jobs that connect to it. I haven't tried that myself, though, and I don't know if there are any gotchas or instabilities with this setup. If the picard server dies or becomes unresponsive, all clients will timeout and eventually crash. There is no continuation from a crashed eval job, unfortunately.
I don't think you should start the first job with launch_picard = True
and the others with launch_picard = False
, because the picard server will belong to the process of the first job, and when that one terminates, the picard server will terminate, too. The picard launch wrapper was meant as a convenience tool for people who are blissfully unaware that picard is a dedicated server component. Your use case doesn't fit the pattern for which this launcher is useful.
About that messed up port issue. I don't know how to fix it with limited permissions. As root, I would use lsof
of fuser
(google it) to figure out which (zombie) process uses that port and kill it. Not sure how much can be done as non-root. If the picard code was updated to make the port configurable, you could redeploy.
Btw, have you heard of docker-in-docker?
Re PS: Yes, I'd love to accept a new dataset contribution!
Thanks again for the quick and detailed reply! This is very helpful. :slightly_smiling_face:
I don't think you should start the first job with
launch_picard = True
and the others withlaunch_picard = False
, because the picard server will belong to the process of the first job, and when that one terminates, the picard server will terminate, too. The picard launch wrapper was meant as a convenience tool for people who are blissfully unaware that picard is a dedicated server component. Your use case doesn't fit the pattern for which this launcher is useful.
I absolutely agree. This was just a quick and dirty workaround to test if I could run multiple evaluations using Picard without having to write/rewrite a lot of code in the Container.
About that messed up port issue. I don't know how to fix it with limited permissions. As root, I would use
lsof
offuser
(google it) to figure out which (zombie) process uses that port and kill it. Not sure how much can be done as non-root. If the picard code was updated to make the port configurable, you could redeploy.
This was my first instinct as well. Unfortunately neither lsof
nor fuser
are available in this environment and I can't install them either. I also tried to terminate all of the user-processes that I was able to see, effectively pulling the plug on the docker run-command and hard-rebooting the container. But even this didn't work. I believe the leaked semaphores are stored somewhere higher up in the SysV / POSIX Synchronization Mechanism, beyond the scope of my user. I also tried to manually purge /dev/shm
to no avail. In the syslog the message: There appear to be 6 leaked semaphores to clean up at shutdown.
can be found. So I hope a proper reboot will help. I'll know more tomorrow. Otherwise I'll try your next proposal:
Btw, have you heard of docker-in-docker?
I haven't, yet. Thank you for mentioning it - I will definitely take a look because it could make a lot of things easier in this environment. Yesterday evening the thought of running docker inside of the docker container crossed my mind briefly. But since I had already spent a decent amount of time looking for the source of the error, I decided to write it off for now and wait for Monday, when the admin can restart the container.
Based on your use case, I can see a number of solutions:
- Run multiple eval jobs concurrently and one picard server per job. It is possible to run more than one thrift server per host if every server uses a different port. However, unfortunately, for picard, I have hardcoded the port to
9090
, https://github.com/ServiceNow/picard/blob/6a252386bed6d4233f0f13f4562d8ae8608e7445/picard/exe/Main.hs#L649 , and that means that the port will be in use once the first picard server starts. Someone needs to implement a change that this port becomes configurable. I don't know when/if I will have time for that. Contributions welcome.
I actually tried this first, when I noticed the picard_port
variable in the PicardParameters-object. As you wrote, this isn't picked up by the backend, yet. I would love to contribute here, but I haven't used Haskell, yet. One of the languages I always wanted to look into but haven't got to it. Since the deadline is currently breathing down my neck, I wouldn't feel confident in contributing hasty changes to this part of the codebase.
- Run one eval job only and use all GPUs for it. You can tell hf/pytroch to use all gpus for inference. That will increase your effective eval batch size
bs
ton * bs
forn
GPUs, and that will speed up evaluation by a factor ofn
. the picard server will work just fine with this.- Run multiple eval jobs and only one picard server that is shared by all of them. You can start the
picard
executable from a terminal, say, intmux
, and then launch a bunch of eval jobs that connect to it. I haven't tried that myself, though, and I don't know if there are any gotchas or instabilities with this setup. If the picard server dies or becomes unresponsive, all clients will timeout and eventually crash. There is no continuation from a crashed eval job, unfortunately.
Those two proposals are what I am aiming for, once the container is up and running again. :rocket:
Yesterday I was trying the latter. But I approached this with a lack of understanding what Thrift is and how it works. Now that I have looked into it, I understand that the Thrift server will be started by the Haskell-Code itself. So simply running runhaskell /app/picard/exe/Main.hs
should be enough to start the Picard server as a standalone, right? It will listen on port 9090 per default, where multiple RPC clients should (in theory) be able to connect to it.(?)
Re PS: Yes, I'd love to accept a new dataset contribution!
Great, then I'll check back with my supervisor at which stage of the publishing process I can open one. :slightly_smiling_face:
So simply running runhaskell /app/picard/exe/Main.hs should be enough to start the Picard server as a standalone, right? It will listen on port 9090 per default, where multiple RPC clients should (in theory) be able to connect to it.(?)
Yes, though runhaskell
will compile the Haskell program and then run it, which requires that you have the whole Haskell toolchain plus all Haskell dependencies in your environment. There is a compiled standalone executable, picard
, in the docker image. It's dynamically linked, though, so you still need to assemble all thrift and Facebook c-toolchain libraries for it to work.
Perfect, thank you very much! That is a very good point and exactly what I was looking for initially. That was a very helpful discussion.
Just to clarify and write this down explicitly, if someone stumbles over this ticket. It's this executable: ./.cabal/bin/picard
.
Very good, I'm going to close this issue.
Hello,
I would like to ask you how can I activate picard. I mean, from the following code I got
Picard is not available.
When I printed the error, I got the messageNo module named 'picard.clients'
.I will really appreciate any help. Thank you.