Closed georgeamccarthy closed 3 years ago
Closing this, as we're gonna use GCP for now. Might occur again there, we'll see.
@georgeamccarthy does GCP give you the same problem, or has that been fixed?
@Rubix982 Yes it does, I believe some network activity is required when the ProtBert model is loaded for which I should open the firewall.
Are you experiencing this issue?
@georgeamccarthy let's keep this open until #30 is merged/closed.
When you setup a jina backend on any instance, it needs to be bootstrapped with some method, and this is most like a main
that is supposed to contain the Flow
for jina
. Since this project relies on Rostbert/prot_bert
, what most likely happens the following lines are called,
flow = (
Flow(port_expose=8020, protocol='http')
.add(uses=ProtBertExecutor)
.add(uses=MyIndexer)
)
Is that the Rostbert/prot_bert
model needs to be downloaded from Hugging Face
and setup by the my_executor.py
with the line BertModel.from_pretrained("Rostlab/prot_bert")
.
This is conflicting to the jina
setup. On one hand, jina
is trying to load the executor
to start the flow, but the executor
is 'blocked' until that Hugging Face is completely installed and setup. Thus, there is a runtime timeout caused by jina
because it has 'failed' to load the ProtBertExecutor
(which is actually just blocked) and is taking jina
longer than average to load an Executor
.
jina
throws an exception that it is failing to load the executor
with logs that are similar to,
protein-search-backend | pod0@16[C]:can not load the executor from ProtBertExecutor
protein-search-backend | pod0@16[E]:ExecutorFailToLoad() during <class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> initialization
protein-search-backend | add "--quiet-error" to suppress the exception details
This exception propagates to my_executors.py
, which also starts to throw exception because the calling object for ProtBertExecutor
(see the python lines above) has also thrown an exception. In this case, my_executors.py
has no option rather than to 'gracefully' shutdown and close the connection it was making to download the model. In networking language, the underlying client under my_executors.py
for downloading the pre-trained model sends an RST packet to Hugging Face to reset
or finish
the connection. Which is why you may see logs similar to,
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 438, in _error_catcher
protein-search-backend | yield
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 519, in read
protein-search-backend | data = self._fp.read(amt) if not fp_closed else b""
protein-search-backend | File "/usr/local/lib/python3.8/http/client.py", line 459, in read
protein-search-backend | n = self.readinto(b)
protein-search-backend | File "/usr/local/lib/python3.8/http/client.py", line 503, in readinto
protein-search-backend | n = self.fp.readinto(b)
protein-search-backend | File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
protein-search-backend | return self._sock.recv_into(b)
protein-search-backend | File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
protein-search-backend | return self.read(nbytes, buffer)
protein-search-backend | File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
protein-search-backend | return self._sslobj.read(len, buffer)
protein-search-backend | ConnectionResetError: [Errno 104] Connection reset by peer
Going a bit deeper into the problem, we see the logs,
protein-search-backend | File "/app/src/my_executors.py", line 24, in __init__
protein-search-backend | model = BertModel.from_pretrained("Rostlab/prot_bert")
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1266, in from_pretrained
protein-search-backend | raise EnvironmentError(msg)
protein-search-backend | OSError: Can't load weights for 'Rostlab/prot_bert'. Make sure that:
protein-search-backend |
protein-search-backend | - 'Rostlab/prot_bert' is a correct model identifier listed on 'https://huggingface.co/models'
protein-search-backend |
protein-search-backend | - or 'Rostlab/prot_bert' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.
Which indicates that BertModel.from_pretrained("Rostlab/prot_bert")
had failed to properly download the pre-trained model, thus, causing jina
to exit with 1
- which in UNIX status codes means 'EXIT_FAILURE'.
Super interesting writeup! I had seen that error before but not really known why. We could try loading BertModel
outside jina, just within a python Docker environment and see if that works, then slowly add in complexity until we get something working. Or to test whether the above hypothesis is correct directly, load BertModel
outside of a flow. I think that latter case has been implemented for indexing in #50, but not for searching.
At the moment, I'm trying to setup a Docker flow where the pretrained model gets fetched before the Flow execution. Further discussion in #30
This should be closed by #30
@georgeamccarthy I believe this can be closed now.
Model loading has changed with #30, so this is not relevant anymore.
Describe the bug When running the backend on the most basic DigitalOcean droplet (1 GB Memory / 25 GB Disk / LON1 - Ubuntu 20.04 (LTS) x64) the python cannot execute
model = BertModel.from_pretrained("Rostlab/prot_bert")
the app hangs on this line until timeout.To Reproduce Steps to reproduce the behavior:
python3 backend/app.py
Expected behavior Should load the model before timeout.