Rubix982 commented 3 years ago

Pull Request Type

🏆 Enhancements - this PR aims to dockerize the application

Purpose

It introduces containers and ease of building and adding features without worrying for configuration

Why?

As per changes required for the demo

Changes Introduced

[x] Removes the requirements.txt from root, splits dependencies amongst the backend and the frontend, by creating individual requirements.txt
[X] Moves the data/ from the root into backend/
[X] Moves all *.py files in backend to backend/src/
[X] Introduced image tag names for the images to be pushed to Docker Hub. This is the repository for the frontend, and the backend

Bugs (WIP)

[x] Frontend cannot make connection with the backend, Errno 111 - Connection Refused
[x] Jina requires aiohttp - to be added in requirements.txt
[x] Testing to make sure the frontend and the backend work without any problems

Notes

The backend container is gigantic, it's near 1 GB due to the torch dependency (831MB). I was able to cache the containers, which means you will only need to install the requirements once for both the containers, and it should practically load them given that the dependencies have not changed
To start the containers, you need to have docker, docker-compose on a machine. The containers can be built and started with running make docker in the root. They can be temporarily closed with Ctrl^C, started again with make up, and removed with make remove
A user called jina in the Dockerfile was created because pip does not like installing as root
Jina by itself download a file called pdb_data_seq.csv which is 10K lines long (hence so much green in this PR), I'm not sure why it did that

Feedback required over

A quick pair of :eyes: on the code
Discussion on the technical approach

Mentions

@georgeamccarthy
@fissoreg

Rubix982 commented 3 years ago

@georgeamccarthy how do I prevent jina from downloading it? It does that if I remove the file. I'll try removing it again and see if it can work as expected without the .csv file?

Rubix982 commented 3 years ago

These are the logs that the backend generates after I've installed aiohttp. I need help with this since I'm not sure what a peapod is, and what is the resource it is trying to reach, but instead throws a TimeOutError for, referring to the below message,

protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f3328dada90> can not be started due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'), Flow is aborted

Other that that, the frontend still cannot be make an established connection.

Also, a folder is created under backend/ called embeddings that has a protein.json in it. Should I add embeddings in the .gitignore?

Notice in the 5th download line, it is pulling something 1.68G in size (scary :fearful: ).

Rubix982 commented 3 years ago

Checked. So I deleted the containers entirely, removed embeddings and data/pdb_data_seq.csv, then built the containers from scratch. So the backend still pulls them, as seen in the below screenshot. We can either think of why this happens, or simply add this to .gitignore as well. Also, it jina always downloads 4 things, but doesn't mention what they are, as seen in the screen shot below,

After this log output, the container spends exactly 10ms (600000ms) trying to reach a peapod, as shown in the screenshot in the previous comment.

georgeamccarthy commented 3 years ago

@georgeamccarthy how do I prevent jina from downloading it? It does that if I remove the file. I'll try removing it again and see if it can work as expected without the .csv file?

The intended behaviour is for the file to be downloaded on first run. So if that's what it's doing that's ok, it's an unavoidably large file because it's needed for computing the embeddings. Have I understood your question?

These are the logs that the backend generates after I've installed aiohttp. I need help with this since I'm not sure what a peapod is, and what is the resource it is trying to reach, but instead throws a TimeOutError for, referring to the below message,

protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f3328dada90> can not be started due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'), Flow is aborted

Not really sure what a Peapod is but this means our flow is unable str start after 600000.0ms of trying. I'm having a similar issue #31 which I think might be related. Not sure what to suggest for now, I'm looking into it. It's possible that it's unrelated to Docker.

Other that that, the frontend still cannot be make an established connection.

The backend will need to start successfully before the frontend will connect. Hopefully once the previous issue is sorted it will work.

Also, a folder is created under backend/ called embeddings that has a protein.json in it. Should I add embeddings in the .gitignore? https://user-images.githubusercontent.com/41635766/126763589-504ae2ce-6120-492c-8e4c-8b55418cb069.png Notice in the 5th download line, it is pulling something 1.68G in size (scary 😨 ).

😱 Yes please! Currently we're having the host computer compute the embeddings on first run.

Rubix982 commented 3 years ago

The intended behaviour is for the file to be downloaded on first run. So if that's what it's doing that's ok, it's an unavoidably large file because it's needed for computing the embeddings. Have I understood your question?

Yep!

Not really sure what a Peapod is but this means our flow is unable str start after 600000.0ms of trying. I'm having a similar issue #31 which I think might be related. Not sure what to suggest for now, I'm looking into it. It's possible that it's unrelated to Docker.

Interesting.

The backend will need to start successfully before the frontend will connect. Hopefully once the previous issue is sorted it will work.

Noted.

Yes please! Currently we're having the host computer compute the embeddings on first run.

Noted.

fissoreg commented 3 years ago

Some general remarks:

adding aiohttp as a dependence is probably not a good idea. To support the http protocol, Jina should be installed with pip install "jina[client,http]" according to doc
as base Docker image, we could use the official Jina image: https://hub.docker.com/r/jinaai/jina. This would avoid the problems with the PATH variables and the need to make a new user.
Dockerfiles for backend and frontend are pretty similar, they could be joined.

Cool stuff @Rubix982 , welcome to the project! ;)

Rubix982 commented 3 years ago

adding aiohttp as a dependence is probably not a good idea. To support the http protocol, Jina should be installed with pip install "jina[client,http]" according to doc

The problem with this is that I have to figure out (and I'm not sure if this is possible) how to cache that. I did try doing pip install jina[client,http], but for me, it was not caching. Which means on every container build, it downloads those dependencies from scratch, which seemed like wasted bandwidth for me.

Python and Docker have this weird thing that if I install any dependency with the command RUN pip install pkg - this does not get cached. But if I add a requirements.txt with the package name AND version specified, it caches (I"m not sure what magic is this).

I'll try it again in some time if this indeed takes care of the caching the way I want it to. Docker best practices do not like the idea of reinstalling dependencies over and over again. This is a big no-no in distributed computing.

as base Docker image, we could use the official Jina image: https://hub.docker.com/r/jinaai/jina. This would avoid the problems with the PATH variables and the need to make a new user.

I tried this ... and the Docker image only offered some jina related commands, and I'm not familiar with CLI that the jina offers. I took a quick 10-15 minutes look at it, but it did not seem to do what I was thinking about making this PR.

By default, Docker creates the container and enters as root. More documentation reading required here.

Dockerfiles for backend and frontend are pretty similar, they could be joined.

For this, I need to know where you guys want me to remove the data folder entirely.

As homework for me, I have to look into,

How does the Jina Docker image work and what does it offer?
Is it possible to provide the same build context and share it among two separate container builds?

@fissoreg I'm waiting for our 1-to-1 so you can introduce me to @jina-ai more so I can contribute as well. :+1:

georgeamccarthy commented 3 years ago

In addition to some reading you might find the Jina slack helpful, they really welcome discussion on anything from simple to advanced. I've found it super helpful! :) http://slack.jina.ai

fissoreg commented 3 years ago

The problem with this is that I have to figure out (and I'm not sure if this is possible) how to cache that. I did try doing pip install jina[client,http], but for me, it was not caching. Which means on every container build, it downloads those dependencies from scratch, which seemed like wasted bandwidth for me.

I would have said that wasted bandwidth is better then managing dependencies manually...

I'll try it again in some time if this indeed takes care of the caching the way I want it to. Docker best practices do not like the idea of reinstalling dependencies over and over again. This is a big no-no in distributed computing.

...but this is a convincing remark!

Anyways, the following is strange:

Python and Docker have this weird thing that if I install any dependency with the command RUN pip install pkg - this does not get cached. But if I add a requirements.txt with the package name AND version specified, it caches (I"m not sure what magic is this).

So I think that the best thing would be to try to understand what is happening there and fix it. Let me add, as @gmelodie would say: "Research time is not wasted time".

@fissoreg I'm waiting for our 1-to-1 so you can introduce me to @jina-ai more so I can contribute as well. +1

I hope this will be helpful! :)

georgeamccarthy commented 3 years ago

What's the status on this @Rubix982 @fissoreg? :)

Rubix982 commented 3 years ago

@georgeamccarthy I was not able to work from Mon-Wed because of technical difficulties on my end. There are some pending tasks that I have to get to till this Sunday - some issues to wrap up on @mapillary, and some other tasks here and there.

Is it alright if I come back to this on Sunday?

georgeamccarthy commented 3 years ago

Great! No rush, let us know if you need help. You are welcome to join our standup 30 minutes before the MLH standup on Monday.

Rubix982 commented 3 years ago

Great, thanks! Catch you guys in your standup, then. :D

Rubix982 commented 3 years ago

@georgeamccarthy @fissoreg the torch dependency is 831.4 MB large, and is only used once in the entire jina backend in the following code in the file my_executors.py,

    def encode_batch(self, docs: DocumentArray, **kwargs) -> DocumentArray:

        log('Preprocessing.')
        sequences = self.preprocessing(docs.get_attributes("text"))

        log('Tokenizing')
        encoded_inputs = self.tokenizer(
            sequences,
            padding=True,
            max_length=max(sequences, key=len),
            return_tensors="pt",
        )

        with torch.no_grad():
            log('Computing embeddings.')
            outputs = self.model(**encoded_inputs)
            log('Getting last hidden state.')
            embeds = outputs.last_hidden_state[:, 0, :].detach().numpy()
            for doc, embed in zip(docs, embeds):
                log(f'Getting embedding {doc.id}')
                doc.embedding = embed

        return docs

Can this be replaced with something more light weight? It makes the container larger than 900 MB including other dependencies, so it's not light weight in terms of storage.

georgeamccarthy commented 3 years ago

Gian knows more about this so cc @fissoreg

fissoreg commented 3 years ago

Unfortunately the torch dependency is used internally by the transformers library, so we need it. All of the neural processing is done with torch.

Rubix982 commented 3 years ago

So I've spent time trying to figure out the jinaai/jina:latest docker image. I've found it is only the cli, and I don't think it might even have Python's base libraries - I could not run python with it. The below is a screenshot of what the docker image does when I run it.

I decided to then read the jina cli documentation, but I did not really find what way I could use to bootstrap src/app.py that would start the application.

Maybe I could not find the exact thing that I was looking for. There is a lot of content, and I'm not really sure if I could extract the right information.

One thing I was thinking of was creating an issue on jina-ai/jina about having an example repository as a playground for the CLI - there were many terms and methods mentioned that I was not sure of why are they there or what purpose they serve, or how to use them against something. What do you think, @georgeamccarthy?

@fissoreg Since the execution of the backend startes with src/app.py, I wish to run python src/app.py for starting the backend. Do you know the equivalent of this with the jina cli?

I'm heading back to using the Python docker image and trying to fix some issues. The log for these problems are as the following,

Downloading: 100%|██████████| 81.0/81.0 [00:00<00:00, 40.4kB/s]
Downloading: 100%|██████████| 112/112 [00:00<00:00, 47.3kB/s]
Downloading: 100%|██████████| 86.0/86.0 [00:00<00:00, 33.5kB/s]
Downloading: 100%|██████████| 361/361 [00:00<00:00, 141kB/s]
Downloading:  26%|██▌       | 431M/1.68G [03:56<11:19, 1.84MB/s]("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend |            pod0@16[C]:can not load the executor from ProtBertExecutor
protein-search-backend |            pod0@16[E]:ExecutorFailToLoad() during <class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> initialization
protein-search-backend |  add "--quiet-error" to suppress the exception details
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 438, in _error_catcher
protein-search-backend |     yield
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 519, in read
protein-search-backend |     data = self._fp.read(amt) if not fp_closed else b""
protein-search-backend |   File "/usr/local/lib/python3.8/http/client.py", line 459, in read
protein-search-backend |     n = self.readinto(b)
protein-search-backend |   File "/usr/local/lib/python3.8/http/client.py", line 503, in readinto
protein-search-backend |     n = self.fp.readinto(b)
protein-search-backend |   File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
protein-search-backend |     return self._sock.recv_into(b)
protein-search-backend |   File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
protein-search-backend |     return self.read(nbytes, buffer)
protein-search-backend |   File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
protein-search-backend |     return self._sslobj.read(len, buffer)
protein-search-backend | ConnectionResetError: [Errno 104] Connection reset by peer
protein-search-backend | 
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend | 
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/requests/models.py", line 758, in generate
protein-search-backend |     for chunk in self.raw.stream(chunk_size, decode_content=True):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 576, in stream
protein-search-backend |     data = self.read(amt=amt, decode_content=decode_content)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 541, in read
protein-search-backend |     raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
protein-search-backend |   File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
protein-search-backend |     self.gen.throw(type, value, traceback)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 455, in _error_catcher
protein-search-backend |     raise ProtocolError("Connection broken: %r" % e, e)
protein-search-backend | urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend | 
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend | 
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1249, in from_pretrained
protein-search-backend |     resolved_archive_file = cached_path(
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1363, in cached_path
protein-search-backend |     output_path = get_from_cache(
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1626, in get_from_cache
protein-search-backend |     http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, headers=headers)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1485, in http_get
protein-search-backend |     for chunk in r.iter_content(chunk_size=1024):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/requests/models.py", line 761, in generate
protein-search-backend |     raise ChunkedEncodingError(e)
protein-search-backend | requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend | 
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend | 
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 87, in _load_executor
protein-search-backend |     self._executor = BaseExecutor.load_config(
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 553, in load_config
protein-search-backend |     return JAML.load(tag_yml, substitute=False)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 89, in load
protein-search-backend |     r = yaml.load(stream, Loader=JinaLoader)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/yaml/__init__.py", line 114, in load
protein-search-backend |     return loader.get_single_data()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 51, in get_single_data
protein-search-backend |     return self.construct_document(node)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 55, in construct_document
protein-search-backend |     data = self.construct_object(node)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 100, in construct_object
protein-search-backend |     data = constructor(self, node)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 426, in _from_yaml
protein-search-backend |     return get_parser(cls, version=data.get('version', None)).parse(cls, data)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/parsers/executor/legacy.py", line 69, in parse
protein-search-backend |     obj = cls(
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/executors/decorators.py", line 65, in arg_wrapper
protein-search-backend |     f = func(self, *args, **kwargs)
protein-search-backend |   File "/app/src/my_executors.py", line 24, in __init__
protein-search-backend |     model = BertModel.from_pretrained("Rostlab/prot_bert")
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1266, in from_pretrained
protein-search-backend |     raise EnvironmentError(msg)
protein-search-backend | OSError: Can't load weights for 'Rostlab/prot_bert'. Make sure that:
protein-search-backend | 
protein-search-backend | - 'Rostlab/prot_bert' is a correct model identifier listed on 'https://huggingface.co/models'
protein-search-backend | 
protein-search-backend | - or 'Rostlab/prot_bert' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.
protein-search-backend | 
protein-search-backend | 
protein-search-backend | 
protein-search-backend | The above exception was the direct cause of the following exception:
protein-search-backend | 
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 78, in run
protein-search-backend |     runtime = runtime_cls(
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 59, in __init__
protein-search-backend |     self._load_executor()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 104, in _load_executor
protein-search-backend |     raise ExecutorFailToLoad from ex
protein-search-backend | jina.excepts.ExecutorFailToLoad
protein-search-backend |            Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f49b6e16340> can not be started due to RuntimeFailToStart(), Flow is aborted
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "src/app.py", line 40, in <module>
protein-search-backend |     main()
protein-search-backend |   File "src/app.py", line 33, in main
protein-search-backend |     with flow:
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 919, in __enter__
protein-search-backend |     return self.start()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 964, in start
protein-search-backend |     v.wait_start_success()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 510, in wait_start_success
protein-search-backend |     p.wait_start_success()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 276, in wait_start_success
protein-search-backend |     raise RuntimeFailToStart
protein-search-backend | jina.excepts.RuntimeFailToStart
Downloading:   2%|▏         | 34.9M/1.68G [00:19<18:03, 1.52MB/s]("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend |            pod0@16[C]:can not load the executor from ProtBertExecutor
protein-search-backend |            pod0@16[E]:ExecutorFailToLoad() during <class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> initialization
protein-search-backend |  add "--quiet-error" to suppress the exception details
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 438, in _error_catcher
protein-search-backend |     yield
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 519, in read
protein-search-backend |     data = self._fp.read(amt) if not fp_closed else b""
protein-search-backend |   File "/usr/local/lib/python3.8/http/client.py", line 459, in read
protein-search-backend |     n = self.readinto(b)
protein-search-backend |   File "/usr/local/lib/python3.8/http/client.py", line 503, in readinto
protein-search-backend |     n = self.fp.readinto(b)
protein-search-backend |   File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
protein-search-backend |     return self._sock.recv_into(b)
protein-search-backend |   File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
protein-search-backend |     return self.read(nbytes, buffer)
protein-search-backend |   File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
protein-search-backend |     return self._sslobj.read(len, buffer)
protein-search-backend | ConnectionResetError: [Errno 104] Connection reset by peer
protein-search-backend | 
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend | 
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/requests/models.py", line 758, in generate
protein-search-backend |     for chunk in self.raw.stream(chunk_size, decode_content=True):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 576, in stream
protein-search-backend |     data = self.read(amt=amt, decode_content=decode_content)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 541, in read
protein-search-backend |     raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
protein-search-backend |   File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
protein-search-backend |     self.gen.throw(type, value, traceback)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 455, in _error_catcher
protein-search-backend |     raise ProtocolError("Connection broken: %r" % e, e)
protein-search-backend | urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend | 
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend | 
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1249, in from_pretrained
protein-search-backend |     resolved_archive_file = cached_path(
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1363, in cached_path
protein-search-backend |     output_path = get_from_cache(
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1626, in get_from_cache
protein-search-backend |     http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, headers=headers)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1485, in http_get
protein-search-backend |     for chunk in r.iter_content(chunk_size=1024):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/requests/models.py", line 761, in generate
protein-search-backend |     raise ChunkedEncodingError(e)
protein-search-backend | requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend | 
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend | 
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 87, in _load_executor
protein-search-backend |     self._executor = BaseExecutor.load_config(
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 553, in load_config
protein-search-backend |     return JAML.load(tag_yml, substitute=False)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 89, in load
protein-search-backend |     r = yaml.load(stream, Loader=JinaLoader)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/yaml/__init__.py", line 114, in load
protein-search-backend |     return loader.get_single_data()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 51, in get_single_data
protein-search-backend |     return self.construct_document(node)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 55, in construct_document
protein-search-backend |     data = self.construct_object(node)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 100, in construct_object
protein-search-backend |     data = constructor(self, node)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 426, in _from_yaml
protein-search-backend |     return get_parser(cls, version=data.get('version', None)).parse(cls, data)
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/parsers/executor/legacy.py", line 69, in parse
protein-search-backend |     obj = cls(
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/executors/decorators.py", line 65, in arg_wrapper
protein-search-backend |     f = func(self, *args, **kwargs)
protein-search-backend |   File "/app/src/my_executors.py", line 24, in __init__
protein-search-backend |     model = BertModel.from_pretrained("Rostlab/prot_bert")
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1266, in from_pretrained
protein-search-backend |     raise EnvironmentError(msg)
protein-search-backend | OSError: Can't load weights for 'Rostlab/prot_bert'. Make sure that:
protein-search-backend | 
protein-search-backend | - 'Rostlab/prot_bert' is a correct model identifier listed on 'https://huggingface.co/models'
protein-search-backend | 
protein-search-backend | - or 'Rostlab/prot_bert' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.
protein-search-backend | 
protein-search-backend | 
protein-search-backend | 
protein-search-backend | The above exception was the direct cause of the following exception:
protein-search-backend | 
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 78, in run
protein-search-backend |     runtime = runtime_cls(
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 59, in __init__
protein-search-backend |     self._load_executor()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 104, in _load_executor
protein-search-backend |     raise ExecutorFailToLoad from ex
protein-search-backend | jina.excepts.ExecutorFailToLoad
protein-search-backend |            Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f5213e3ab80> can not be started due to RuntimeFailToStart(), Flow is aborted
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "src/app.py", line 40, in <module>
protein-search-backend |     main()
protein-search-backend |   File "src/app.py", line 33, in main
protein-search-backend |     with flow:
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 919, in __enter__
protein-search-backend |     return self.start()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 964, in start
protein-search-backend |     v.wait_start_success()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 510, in wait_start_success
protein-search-backend |     p.wait_start_success()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 276, in wait_start_success
protein-search-backend |     raise RuntimeFailToStart
protein-search-backend | jina.excepts.RuntimeFailToStart
protein-search-backend exited with code 1

Notice the keywords ProtBertExecutor, pod0, and the logs of,

protein-search-backend | - 'Rostlab/prot_bert' is a correct model identifier listed on 'https://huggingface.co/models'
protein-search-backend | 
protein-search-backend | - or 'Rostlab/prot_bert' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

...

protein-search-backend |            Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f49b6e16340> can not be started due to RuntimeFailToStart(), Flow is aborted

...

Downloading:   2%|▏         | 34.9M/1.68G [00:19<18:03, 1.52MB/s]("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend |            pod0@16[C]:can not load the executor from ProtBertExecutor
protein-search-backend |            pod0@16[E]:ExecutorFailToLoad() during <class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> initialization
protein-search-backend |  add "--quiet-error" to suppress the exception details

Rubix982 commented 3 years ago

Please see this comment for an explanation over why these logs might be happening.

Rubix982 commented 3 years ago

@fissoreg @georgeamccarthy just a completely crazy idea here, but imagine if there was a tool that made it easy to use dependencies in such a way that, let's say the torch dependency is used internally in transformers - that's all fine until the torch dependency is full of all these others things and gets to this gigantic size - not everyone has the internet bandwidth to use it just for one dependency nor is it a fast thing to setup on their machine - that I'm not just using as a user, and since I'm only using a subset of all the functionality that torch provides to fulfill requirements in transformers ... what if there was a way or an architectural/structural idea that could help us to only retrieve a part of the codebase that is useful to me ... and leave everything else behind.

In some ideal world, that would be a much lighter installation method, way faster to setup and install, and will enforce ideas about keeping everything organized in teams so two people may not cross over into each other's work.

Rubix982 commented 3 years ago

Finally arriving at newer logs without any connection loss,

Downloading: 100%|██████████| 81.0/81.0 [00:00<00:00, 35.7kB/s]
Downloading: 100%|██████████| 112/112 [00:00<00:00, 44.3kB/s]
Downloading: 100%|██████████| 86.0/86.0 [00:00<00:00, 32.5kB/s]
Downloading: 100%|██████████| 361/361 [00:00<00:00, 124kB/s]
Downloading:  64%|██████▎   | 1.07G/1.68G [09:45<05:46, 1.77MB/s]           pod0@ 1[W]:<class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> timeout after waiting for 600000ms, if your executor takes time to load, you may increase --timeout-ready
protein-search-backend |            pod0@ 1[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
Downloading:  66%|██████▌   | 1.10G/1.68G [29:52<05:10, 1.87MB/s]           pod0@ 1[W]:Terminating process after waiting for readiness signal for graceful shutdown
protein-search-backend |            pod0@ 1[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
protein-search-backend |            pod0@ 1[W]:Terminating process after waiting for readiness signal for graceful shutdown
protein-search-backend |            Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f15096275b0> can not be started due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'), Flow is aborted
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "src/app.py", line 43, in <module>
protein-search-backend |     main()
protein-search-backend |   File "src/app.py", line 36, in main
protein-search-backend |     with flow:
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 919, in __enter__
protein-search-backend |     return self.start()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 964, in start
protein-search-backend |     v.wait_start_success()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 510, in wait_start_success
protein-search-backend |     p.wait_start_success()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 288, in wait_start_success
protein-search-backend |     raise TimeoutError(
protein-search-backend | TimeoutError: jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms
Downloading:  64%|██████▍   | 1.08G/1.68G [09:49<05:03, 2.00MB/s]           pod0@ 1[W]:<class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> timeout after waiting for 600000ms, if your executor takes time to load, you may increase --timeout-ready
protein-search-backend |            pod0@ 1[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
Downloading:  67%|██████▋   | 1.13G/1.68G [10:18<04:56, 1.86MB/s]           pod0@ 1[W]:Terminating process after waiting for readiness signal for graceful shutdown
protein-search-backend |            pod0@ 1[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
protein-search-backend |            pod0@ 1[W]:Terminating process after waiting for readiness signal for graceful shutdown
protein-search-backend |            Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f5840f3ff70> can not be started due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'), Flow is aborted
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "src/app.py", line 43, in <module>
protein-search-backend |     main()
protein-search-backend |   File "src/app.py", line 36, in main
protein-search-backend |     with flow:
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 919, in __enter__
protein-search-backend |     return self.start()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 964, in start
protein-search-backend |     v.wait_start_success()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 510, in wait_start_success
protein-search-backend |     p.wait_start_success()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 288, in wait_start_success
protein-search-backend |     raise TimeoutError(
protein-search-backend | TimeoutError: jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms
protein-search-backend exited with code 1
Downloading:  63%|██████▎   | 1.06G/1.68G [09:49<06:46, 1.54MB/s]           pod0@ 1[W]:<class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> timeout after waiting for 600000ms, if your executor takes time to load, you may increase --timeout-ready
protein-search-backend |            pod0@ 1[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
Downloading: 100%|██████████| 1.68G/1.68G [15:41<00:00, 1.79MB/s]
protein-search-backend | Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
protein-search-backend | - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
protein-search-backend | - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
protein-search-backend |            Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f14434f9f70> can not be started due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'), Flow is aborted
protein-search-backend | Traceback (most recent call last):
protein-search-backend |   File "src/app.py", line 43, in <module>
protein-search-backend |     main()
protein-search-backend |   File "src/app.py", line 36, in main
protein-search-backend |     with flow:
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 919, in __enter__
protein-search-backend |     return self.start()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 964, in start
protein-search-backend |     v.wait_start_success()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 510, in wait_start_success
protein-search-backend |     p.wait_start_success()
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 288, in wait_start_success
protein-search-backend |     raise TimeoutError(
protein-search-backend | TimeoutError: jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms
protein-search-backend exited with code 1
protein-search-backend | Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias']
protein-search-backend | - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
protein-search-backend | - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
protein-search-backend |            pod0@ 1[L]:ready and listening
protein-search-backend |            pod1@ 1[L]:ready and listening
protein-search-backend |         gateway@ 1[L]:ready and listening
protein-search-backend |            Flow@ 1[I]:🎉 Flow is ready to use!
protein-search-backend |        🔗 Protocol:            HTTP
protein-search-backend |        🏠 Local access:        0.0.0.0:8020
protein-search-backend |        🔒 Private network:     192.168.16.2:8020
protein-search-backend |        💬 Swagger UI:          http://localhost:8020/docs
protein-search-backend |        📚 Redoc:               http://localhost:8020/redoc

So my assumption was correct previously, ProtBertExecutor is trying to download the Rostlab/prot_bert pre-trained model first before it does anything, hence the timeout. Notice that at the line,

Downloading: 100%|██████████| 1.68G/1.68G [15:41<00:00, 1.79MB/s]

After this, the errors start to go away on their own.

To solve this, I can make a quick static function that downloads the model before we execute the below lines,

    flow = (
        Flow(port_expose=8020, protocol='http')
        .add(uses=ProtBertExecutor)
        .add(uses=MyIndexer)
    )

And it should then work as expected.

fissoreg commented 3 years ago

So I've spent time trying to figure out the jinaai/jina:latest docker image. I've found it is only the cli, and I don't think it might even have Python's base libraries - I could not run python with it.

I think this is the Dockerfile for jina:latest. It's a stripped-down Arch installation that provides Python 3.7 and Jina. You can run it like this:

docker run -it --entrypoint /bin/bash  jinaai/jina:latest

and you will have access to the shell. I tried it and you can run python on it.

Anyways, if you use it as a base image, it should behave the same as the python slim-buster image, with the advantage that you have a well-configured Jina installation already available.

@fissoreg Since the execution of the backend startes with src/app.py, I wish to run python src/app.py for starting the backend. Do you know the equivalent of this with the jina cli?

You might have to reset the ENTRYPOINT and play with the CMD in the Dockerfile. Then you can start the container and the backend will automatically be run. No need for the jina cli, I think. Something similar to (untested):

ENTRYPOINT ""
CMD [ "python", "src/app.py" ]

fissoreg commented 3 years ago

Great investigation!

To solve this, I can make a quick static function that downloads the model before we execute the below lines,
    flow = (
        Flow(port_expose=8020, protocol='http')
        .add(uses=ProtBertExecutor)
        .add(uses=MyIndexer)
    )

There's no need to build a Flow to run the executor, we can just instantiate ProtBertExecutor directly.

Anyways, we should think how to go about it. Do we re-download the pretrained model every time the backend is run? Or should we cache it in some way into the Docker image?

Rubix982 commented 3 years ago

I'm testing it out right now to see what the flow of that would be.

By definition, Rostlab/prot_bert only needs to be downloaded once into the container and it can reused until we forcefully delete the container's volume mapped space.

However, this is where the problem is. If we make the downloading of the pre-trained model part of the backend build and execution, it will still have to be done every time it is built from scratch.

An easy solution would be to write a small script that,

Downloads the model from https://huggingface.co/ into some local directory on the user's machine
Moves it to a directory within the backend

We can add a line in .gitignore to ignore this because pushing such a large model onto GitHub is impractical.

With this, we can simply give the from_pretrained function the path of the model. Remember the logs above that mentioned these lines,

protein-search-backend |     model = BertModel.from_pretrained("Rostlab/prot_bert")
protein-search-backend |   File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1266, in from_pretrained
protein-search-backend |     raise EnvironmentError(msg)
protein-search-backend | OSError: Can't load weights for 'Rostlab/prot_bert'. Make sure that:
protein-search-backend | 
protein-search-backend | - 'Rostlab/prot_bert' is a correct model identifier listed on 'https://huggingface.co/models'
protein-search-backend | 
protein-search-backend | - or 'Rostlab/prot_bert' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

Specifically, the "or 'Rostlab/prot_bert' is the correct path to a directory ..." which seems to indicate that from_pretrained is alright with being given a local directory path.

fissoreg commented 3 years ago

I'm testing it out right now to see what the flow of that would be.

By definition, Rostlab/prot_bert only needs to be downloaded once into the container and it can reused until we forcefully delete the container's volume mapped space.

However, this is where the problem is. If we make the downloading of the pre-trained model part of the backend build and execution, it will still have to be done every time it is built from scratch.

What do you mean with ...every time it is built from scratch.? If you mean the build step, then I don't think that's a problem (rather a good solution!). I would avoid writing scripts to manage the download when the from_pretrained function does everything already (and probably more reliably then what we could end up with, if we don't put a lot of effort into it).

Rubix982 commented 3 years ago

Yes, I meant the build steps.

HuggingFace provides ways to get the model from their website. As mentioned here under the button of "Use In Transformers", there are instructions for dealing with the model repository,

git lfs install
git clone https://huggingface.co/Rostlab/prot_bert

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

georgeamccarthy commented 3 years ago

HuggingFace provides ways to get the model from their website. As mentioned here under the button of "Use In Transformers", there are instructions for dealing with the model repository,

I like this solution! I think it would be nice to use a local copy of the model. Suggestion: In the Dockerfile clone the model and modify the code to use the a model. Also include a check for the local model so if it is not found (e.g. because protein_search has been cloned from GitHub instead of from Docker Hub then get ProtBert over the net).

fissoreg commented 3 years ago

HuggingFace provides ways to get the model from their website. As mentioned here under the button of "Use In Transformers", there are instructions for dealing with the model repository,

Sure, but then why don't we stick to using the transformers library as we are currently doing? That's maintained by Huggingface and it does exactly what we are discussing about in this issue (i.e. download the model and cache it for future use). Cloning and managing the repo is more laborious and more prone to all kinds of problems (versioning, download, path management, caching...). Writing code to manage all that means reinventing the wheel. If we don't have a very good reason to directly deal with the repository, I would avoid it.

georgeamccarthy commented 3 years ago

Had a chat to Gian and the summary of our thoughts was: please try and implement a local caching of the model using the transformers library. https://huggingface.co/transformers/

We can load the model and then save it into a local dir using transformers.PreTrainedModel.save_pretrained() and add that to the Docker.

from transformers import BertModel
model = BertModel.from_pretrained("Rostlab/prot_bert")
model.save_pretrained('./models/prot_bert')

Rubix982 commented 3 years ago

Noted, @georgeamccarthy, @fissoreg. Thanks for your time today! And the patience on your guys end with this PR, this is turning out to be a bit challenging than I initially maybe realized, but I'm having fun and learning about a completely new technology (Jina AI) as well.

I'll be taking the approach as mentioned above and will get back to you guys if I run into any problems.

Thanks again, as always!

georgeamccarthy commented 3 years ago

No worries it's been really interesting and we're grateful for your continued efforts! Soon to be a contributor no doubt 😎

Rubix982 commented 3 years ago

Another log I came across was this,

protein-search-backend | Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']

protein-search-backend | - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

protein-search-backend | - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

I believe we are in the "This IS expected" group, right?

fissoreg commented 3 years ago

I'm not sure it is expected, but it is definitely something we noticed before and it doesn't pose problems in running our model. So we just ignored it for now...

Rubix982 commented 3 years ago

I see. Benchmarks needed in future if there is ever a need to look into how to utilize the best from the weight factor.

Rubix982 commented 3 years ago

Gentleman, the caching works and is implemented. Sending a request from my local machine to the container gets me back a gigantic response,

I ran the below code in my local environment,

import requests as r
res = r.post('http://backend:8020/search', headers={'Content-Type': 'application/json'}, data='{"data": [{"text": "AETCZAO" }]}' )

Three bugs I can think of right now,

[X] The results folder needs to be manually created right now. I'll just make a folder for that - FIXED
[X] For some reason, the frontend container cannot establish a connection with the backend container, but I can from my local host machine - FIXED
[X] I would like to initiate self.model and self.tokenizer before running .add(uses=ProtBertExecutor). This will be solution to the problem of when the user is building the images for the first time for the cache, the jina presents a timeout because the Executor is downloading the model and the tokenizer, but jina is great at restoring itself when the files are finally finished downloading. An easy solution might be to write a static function that initializes the two variables, so we would be essentially moving the logic of __init__ to a static function - FIXED

fissoreg commented 3 years ago

I don't see the new code, I guess you didn't push it?

3. I would like to initiate `self.model` and `self.tokenizer` before running `.add(uses=ProtBertExecutor)`. This will be solution to the problem of when the user is building the images for the first time for the cache, the `jina` presents a timeout because the `Executor` is downloading the model and the tokenizer, but `jina` is great at restoring itself when the files are finally finished downloading. An easy solution might be to write a static function that initializes the two variables, so we would be essentially moving the logic of `__init__` to a static function

I thought this was the point of @georgeamccarthy's comment above, where he proposes to use:

from transformers import BertModel
model = BertModel.from_pretrained("Rostlab/prot_bert")
model.save_pretrained('./models/prot_bert')

When to call those lines is still a problem. Either during the Docker image build, or at some point during deployment. For now we can be ok with either (at a certain point, we will want to integrate all of this in some kind of initialisation that also computes the embeddings and does the indexing).

Rubix982 commented 3 years ago

@fissoreg, just pushed.

I thought this was the point of @georgeamccarthy's comment above, where he proposes to use:

This is done and implemented with caching logic introduced in ProtBertExecutor.__init__

When to call those lines is still a problem. Either during the Docker image build, or at some point during deployment. For now we can be ok with either (at a certain point, we will want to integrate all of this in some kind of initialisation that also computes the embeddings and does the indexing).

We can do it before the Flow starts. This way, it will make sure that the model+tokenizer exist - if they do not, fetch and cache them - and then we can go on with the flow as usual. The Docker image takes care of providing a way that lets the model+tokenizer exist locally as well as in the backend container, so the question of how is solved, and what remains is the question of when to check if those exist or not.

A suggested change will require a static function that can be called as such, just before the Flow, so we get something like

def main():
    url = dataset_url
    pdb_data_path = protein_path

    with load_or_download(url, pdb_data_path) as data_file:
        docs_generator = from_csv(
            fp=data_file, field_resolver={"sequence": "text", "structureId": "id"}
        )
        proteins = DocumentArray(docs_generator)[0:42]

    ProtBertExecutor.initialize_executor()

    flow = (
        Flow(port_expose=8020, protocol="http")
        .add(uses=ProtBertExecutor)
        .add(uses=MyIndexer)
    )

    with flow:
        if not os.path.exists(embeddings_path):
            flow.index(proteins)
        flow.block()

Which is responsible for setting the self.model and self.tokenizer.

A part of this is the idea of removing all of the logic from __init__ and moving it to the static method, initialize_executor, which means the __init__ will essentially be empty.

Implemented!

So I removed all the cache, the results, data, model, tokenizer, embeddings, and started make docker with no cache.

On the first round of running make docker, the implementation gives the below logs. No more errors from jina, hurrah. :partying_face: :tada:

On the second round of running make up, the implementation generates the logs,

Rubix982 commented 3 years ago

And the frontend works as expected now. :rocket:

Rubix982 commented 3 years ago

Finalized PR. Help needed in resolving merge conflicts.

Rubix982 commented 3 years ago

@georgeamccarthy @fissoreg do you guys think this is good to go? We can maybe have a 1-to-1 over how to deal with all the conflicting files together. :+1:

fissoreg commented 3 years ago

@Rubix982 this all looks great, hope to merge as soon as possible!

I merged on my side and resolved conflicts, but I'm having some troubles with the backend. I'm gonna investigate that and commit the merge here, probably on Monday. Thanks for the effort, and enjoy the weekend! ;)

georgeamccarthy commented 3 years ago

Great stuff guys love it! @Rubix982 do you agree to the terms of the contributing agreement? Have a quick read

Rubix982 commented 3 years ago

I agree to the contributing agreement, yes. Do I have to make a signature somewhere?

Also. if you guys plan on enforcing the contributing agreement, look into the CLA Assisstant that prevents any PRs and blocks them until they sign. It's similar to how FB will block your PR on our repository until you have signed the Facebook CLA agreement.

georgeamccarthy commented 3 years ago

That's cracking and thanks for the link! No signature needed 😎

Rubix982 commented 3 years ago

@georgeamccarthy, @fissoreg I'm unable to merge by myself. Need help with that.

@fissoreg Unfortunately, there is no easy way of using YAML variables inside Dockerfiles. The only way is to use .env files at the moment. As for executors.py, we can use the python-dotenv package to import the variables from the .env into a python script.

fissoreg commented 3 years ago

@georgeamccarthy, @fissoreg I'm unable to merge by myself. Need help with that.

The merge is here: https://github.com/Rubix982/protein_search/pull/1 There are some problems, but you can merge that PR so we have the merge commit here and we can keep the discussion flowing. Have a look before merging, to see if all the Dockerfile stuff looks good to you.

@fissoreg Unfortunately, there is no easy way of using YAML variables inside Dockerfiles. The only way is to use .env files at the moment. As for executors.py, we can use the python-dotenv package to import the variables from the .env into a python script.

Ok thanks @Rubix982! Maybe .env files is a good option. If you have other ideas, let's discuss. The important point is that it would be better to have to parametrize the various paths only once and in one place.

Rubix982 commented 3 years ago

This is weird. I did not get a notification on my own repository. Thanks.

I'll try to implement the .env method on my repository and get back to you.

fissoreg commented 3 years ago

I'll try to implement the .env method on my repository and get back to you.

This is great, thanks! But let's make that into a different PR, maybe? So we can merge this one ASAP.

Rubix982 commented 3 years ago

After making the changes suggested by Cristian on Slack, I am able to finally get results from the endpoint /search.

The changes have been made and pushed to my fork.

After the fixes, the Streamlit application throws these errors,

These errors are thrown from line 95 of frontend/app.py,

# Execute the query on the transport
result = client.execute(query, variable_values={"ids": ids})

I believe these bug fixes are independent of this PR's objective. This should be merged and closed, and the issue solved in another PR. What do you think, @fissoreg?

georgeamccarthy commented 3 years ago

Agreed, let's merge and move forward.

fissoreg commented 3 years ago

I believe these bug fixes are independent of this PR's objective. This should be merged and closed, and the issue solved in another PR. What do you think, @fissoreg?

I didn't get this error but yes, let's merge and move on. We will also need to fix the automated tests.

Great job @Rubix982 !

georgeamccarthy / protein_search

[Structure] Dockerizing the project #30

Pull Request Type

Purpose

Why?

Changes Introduced

Bugs (WIP)

Notes

Feedback required over

Mentions

Implemented!