Loading ProtBert model takes too long, times out.

georgeamccarthy commented 3 years ago

Describe the bug When running the backend on the most basic DigitalOcean droplet (1 GB Memory / 25 GB Disk / LON1 - Ubuntu 20.04 (LTS) x64) the python cannot execute model = BertModel.from_pretrained("Rostlab/prot_bert") the app hangs on this line until timeout.

To Reproduce Steps to reproduce the behavior:

Create ubuntu system with 1 GB Memory / 25 GB Disk / LON1 - Ubuntu 20.04 (LTS) x64
run python3 backend/app.py
Hangs

Expected behavior Should load the model before timeout.

georgeamccarthy commented 3 years ago

Closing this, as we're gonna use GCP for now. Might occur again there, we'll see.

Rubix982 commented 3 years ago

@georgeamccarthy does GCP give you the same problem, or has that been fixed?

georgeamccarthy commented 3 years ago

@Rubix982 Yes it does, I believe some network activity is required when the ProtBert model is loaded for which I should open the firewall.

Are you experiencing this issue?

Rubix982 commented 3 years ago

@georgeamccarthy let's keep this open until #30 is merged/closed.

Rubix982 commented 3 years ago

When you setup a jina backend on any instance, it needs to be bootstrapped with some method, and this is most like a main that is supposed to contain the Flow for jina. Since this project relies on Rostbert/prot_bert, what most likely happens the following lines are called,

    flow = (
        Flow(port_expose=8020, protocol='http')
        .add(uses=ProtBertExecutor)
        .add(uses=MyIndexer)
    )

Is that the Rostbert/prot_bert model needs to be downloaded from Hugging Face and setup by the my_executor.py with the line BertModel.from_pretrained("Rostlab/prot_bert").

This is conflicting to the jina setup. On one hand, jina is trying to load the executor to start the flow, but the executor is 'blocked' until that Hugging Face is completely installed and setup. Thus, there is a runtime timeout caused by jina because it has 'failed' to load the ProtBertExecutor (which is actually just blocked) and is taking jina longer than average to load an Executor.

jina throws an exception that it is failing to load the executor with logs that are similar to,

protein-search-backend |            pod0@16[C]:can not load the executor from ProtBertExecutor
protein-search-backend |            pod0@16[E]:ExecutorFailToLoad() during <class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> initialization
protein-search-backend |  add "--quiet-error" to suppress the exception details

This exception propagates to my_executors.py, which also starts to throw exception because the calling object for ProtBertExecutor (see the python lines above) has also thrown an exception. In this case, my_executors.py has no option rather than to 'gracefully' shutdown and close the connection it was making to download the model. In networking language, the underlying client under my_executors.py for downloading the pre-trained model sends an RST packet to Hugging Face to reset or finish the connection. Which is why you may see logs similar to,

protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 438, in _error_catcher
protein-search-backend |     yield
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 519, in read
protein-search-backend |     data = self._fp.read(amt) if not fp_closed else b""
protein-search-backend |   File "/usr/local/lib/python3.8/http/client.py", line 459, in read
protein-search-backend |     n = self.readinto(b)
protein-search-backend |   File "/usr/local/lib/python3.8/http/client.py", line 503, in readinto
protein-search-backend |     n = self.fp.readinto(b)
protein-search-backend |   File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
protein-search-backend |     return self._sock.recv_into(b)
protein-search-backend |   File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
protein-search-backend |     return self.read(nbytes, buffer)
protein-search-backend |   File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
protein-search-backend |     return self._sslobj.read(len, buffer)
protein-search-backend | ConnectionResetError: [Errno 104] Connection reset by peer

Going a bit deeper into the problem, we see the logs,

protein-search-backend |   File "/app/src/my_executors.py", line 24, in __init__
protein-search-backend |     model = BertModel.from_pretrained("Rostlab/prot_bert")
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1266, in from_pretrained
protein-search-backend |     raise EnvironmentError(msg)
protein-search-backend | OSError: Can't load weights for 'Rostlab/prot_bert'. Make sure that:
protein-search-backend | 
protein-search-backend | - 'Rostlab/prot_bert' is a correct model identifier listed on 'https://huggingface.co/models'
protein-search-backend | 
protein-search-backend | - or 'Rostlab/prot_bert' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

Which indicates that BertModel.from_pretrained("Rostlab/prot_bert") had failed to properly download the pre-trained model, thus, causing jina to exit with 1 - which in UNIX status codes means 'EXIT_FAILURE'.

georgeamccarthy commented 3 years ago

Super interesting writeup! I had seen that error before but not really known why. We could try loading BertModel outside jina, just within a python Docker environment and see if that works, then slowly add in complexity until we get something working. Or to test whether the above hypothesis is correct directly, load BertModel outside of a flow. I think that latter case has been implemented for indexing in #50, but not for searching.

Rubix982 commented 3 years ago

At the moment, I'm trying to setup a Docker flow where the pretrained model gets fetched before the Flow execution. Further discussion in #30

Rubix982 commented 3 years ago

This should be closed by #30

Rubix982 commented 3 years ago

@georgeamccarthy I believe this can be closed now.

fissoreg commented 3 years ago

Model loading has changed with #30, so this is not relevant anymore.

georgeamccarthy / protein_search

Loading ProtBert model takes too long, times out. #31