Closed Rubix982 closed 3 years ago
@georgeamccarthy how do I prevent jina
from downloading it? It does that if I remove the file. I'll try removing it again and see if it can work as expected without the .csv
file?
These are the logs that the backend
generates after I've installed aiohttp
. I need help with this since I'm not sure what a peapod
is, and what is the resource it is trying to reach, but instead throws a TimeOutError
for, referring to the below message,
protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f3328dada90> can not be started due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'), Flow is aborted
Other that that, the frontend
still cannot be make an established connection.
Also, a folder is created under backend/
called embeddings
that has a protein.json
in it. Should I add embeddings
in the .gitignore
?
Notice in the 5th download line, it is pulling something 1.68G in size (scary :fearful: ).
Checked. So I deleted the containers entirely, removed embeddings
and data/pdb_data_seq.csv
, then built the containers from scratch. So the backend
still pulls them, as seen in the below screenshot. We can either think of why this happens, or simply add this to .gitignore
as well. Also, it jina
always downloads 4 things, but doesn't mention what they are, as seen in the screen shot below,
After this log output, the container spends exactly 10ms (600000ms) trying to reach a peapod
, as shown in the screenshot in the previous comment.
@georgeamccarthy how do I prevent
jina
from downloading it? It does that if I remove the file. I'll try removing it again and see if it can work as expected without the.csv
file?
The intended behaviour is for the file to be downloaded on first run. So if that's what it's doing that's ok, it's an unavoidably large file because it's needed for computing the embeddings. Have I understood your question?
These are the logs that the
backend
generates after I've installedaiohttp
. I need help with this since I'm not sure what apeapod
is, and what is the resource it is trying to reach, but instead throws aTimeOutError
for, referring to the below message,protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f3328dada90> can not be started due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'), Flow is aborted
Not really sure what a Peapod is but this means our flow is unable str start after 600000.0ms of trying. I'm having a similar issue #31 which I think might be related. Not sure what to suggest for now, I'm looking into it. It's possible that it's unrelated to Docker.
Other that that, the
frontend
still cannot be make an established connection.
The backend will need to start successfully before the frontend will connect. Hopefully once the previous issue is sorted it will work.
Also, a folder is created under
backend/
calledembeddings
that has aprotein.json
in it. Should I addembeddings
in the.gitignore
? https://user-images.githubusercontent.com/41635766/126763589-504ae2ce-6120-492c-8e4c-8b55418cb069.png Notice in the 5th download line, it is pulling something 1.68G in size (scary π¨ ).
π± Yes please! Currently we're having the host computer compute the embeddings on first run.
The intended behaviour is for the file to be downloaded on first run. So if that's what it's doing that's ok, it's an unavoidably large file because it's needed for computing the embeddings. Have I understood your question?
Yep!
Not really sure what a Peapod is but this means our flow is unable str start after 600000.0ms of trying. I'm having a similar issue #31 which I think might be related. Not sure what to suggest for now, I'm looking into it. It's possible that it's unrelated to Docker.
Interesting.
The backend will need to start successfully before the frontend will connect. Hopefully once the previous issue is sorted it will work.
Noted.
Yes please! Currently we're having the host computer compute the embeddings on first run.
Noted.
Some general remarks:
aiohttp
as a dependence is probably not a good idea. To support the http
protocol, Jina should be installed with pip install "jina[client,http]"
according to docPATH
variables and the need to make a new user.Cool stuff @Rubix982 , welcome to the project! ;)
adding aiohttp as a dependence is probably not a good idea. To support the http protocol, Jina should be installed with pip install "jina[client,http]" according to doc
The problem with this is that I have to figure out (and I'm not sure if this is possible) how to cache that. I did try doing pip install jina[client,http]
, but for me, it was not caching. Which means on every container build, it downloads those dependencies from scratch, which seemed like wasted bandwidth for me.
Python and Docker have this weird thing that if I install any dependency with the command RUN pip install pkg
- this does not get cached. But if I add a requirements.txt
with the package name AND version specified, it caches (I"m not sure what magic is this).
I'll try it again in some time if this indeed takes care of the caching the way I want it to. Docker best practices do not like the idea of reinstalling dependencies over and over again. This is a big no-no in distributed computing.
as base Docker image, we could use the official Jina image: https://hub.docker.com/r/jinaai/jina. This would avoid the problems with the PATH variables and the need to make a new user.
I tried this ... and the Docker image only offered some jina related commands, and I'm not familiar with CLI that the jina
offers. I took a quick 10-15 minutes look at it, but it did not seem to do what I was thinking about making this PR.
By default, Docker creates the container and enters as root. More documentation reading required here.
Dockerfiles for backend and frontend are pretty similar, they could be joined.
For this, I need to know where you guys want me to remove the data folder entirely.
As homework for me, I have to look into,
@fissoreg I'm waiting for our 1-to-1 so you can introduce me to @jina-ai more so I can contribute as well. :+1:
In addition to some reading you might find the Jina slack helpful, they really welcome discussion on anything from simple to advanced. I've found it super helpful! :) http://slack.jina.ai
The problem with this is that I have to figure out (and I'm not sure if this is possible) how to cache that. I did try doing
pip install jina[client,http]
, but for me, it was not caching. Which means on every container build, it downloads those dependencies from scratch, which seemed like wasted bandwidth for me.
I would have said that wasted bandwidth is better then managing dependencies manually...
I'll try it again in some time if this indeed takes care of the caching the way I want it to. Docker best practices do not like the idea of reinstalling dependencies over and over again. This is a big no-no in distributed computing.
...but this is a convincing remark!
Anyways, the following is strange:
Python and Docker have this weird thing that if I install any dependency with the command
RUN pip install pkg
- this does not get cached. But if I add arequirements.txt
with the package name AND version specified, it caches (I"m not sure what magic is this).
So I think that the best thing would be to try to understand what is happening there and fix it. Let me add, as @gmelodie would say: "Research time is not wasted time".
@fissoreg I'm waiting for our 1-to-1 so you can introduce me to @jina-ai more so I can contribute as well. +1
I hope this will be helpful! :)
What's the status on this @Rubix982 @fissoreg? :)
@georgeamccarthy I was not able to work from Mon-Wed because of technical difficulties on my end. There are some pending tasks that I have to get to till this Sunday - some issues to wrap up on @mapillary, and some other tasks here and there.
Is it alright if I come back to this on Sunday?
Great! No rush, let us know if you need help. You are welcome to join our standup 30 minutes before the MLH standup on Monday.
Great, thanks! Catch you guys in your standup, then. :D
@georgeamccarthy @fissoreg the torch
dependency is 831.4
MB large, and is only used once in the entire jina
backend in the following code in the file my_executors.py
,
def encode_batch(self, docs: DocumentArray, **kwargs) -> DocumentArray:
log('Preprocessing.')
sequences = self.preprocessing(docs.get_attributes("text"))
log('Tokenizing')
encoded_inputs = self.tokenizer(
sequences,
padding=True,
max_length=max(sequences, key=len),
return_tensors="pt",
)
with torch.no_grad():
log('Computing embeddings.')
outputs = self.model(**encoded_inputs)
log('Getting last hidden state.')
embeds = outputs.last_hidden_state[:, 0, :].detach().numpy()
for doc, embed in zip(docs, embeds):
log(f'Getting embedding {doc.id}')
doc.embedding = embed
return docs
Can this be replaced with something more light weight? It makes the container larger than 900 MB including other dependencies, so it's not light weight in terms of storage.
Gian knows more about this so cc @fissoreg
Unfortunately the torch
dependency is used internally by the transformers
library, so we need it. All of the neural processing is done with torch
.
So I've spent time trying to figure out the jinaai/jina:latest
docker image. I've found it is only the cli
, and I don't think it might even have Python's base libraries - I could not run python
with it. The below is a screenshot of what the docker image does when I run it.
I decided to then read the jina cli
documentation, but I did not really find what way I could use to bootstrap src/app.py
that would start the application.
Maybe I could not find the exact thing that I was looking for. There is a lot of content, and I'm not really sure if I could extract the right information.
One thing I was thinking of was creating an issue on jina-ai/jina about having an example repository as a playground for the CLI
- there were many terms and methods mentioned that I was not sure of why are they there or what purpose they serve, or how to use them against something. What do you think, @georgeamccarthy?
@fissoreg Since the execution of the backend startes with src/app.py
, I wish to run python src/app.py
for starting the backend. Do you know the equivalent of this with the jina cli
?
I'm heading back to using the Python
docker image and trying to fix some issues. The log for these problems are as the following,
Downloading: 100%|ββββββββββ| 81.0/81.0 [00:00<00:00, 40.4kB/s]
Downloading: 100%|ββββββββββ| 112/112 [00:00<00:00, 47.3kB/s]
Downloading: 100%|ββββββββββ| 86.0/86.0 [00:00<00:00, 33.5kB/s]
Downloading: 100%|ββββββββββ| 361/361 [00:00<00:00, 141kB/s]
Downloading: 26%|βββ | 431M/1.68G [03:56<11:19, 1.84MB/s]("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend | pod0@16[C]:can not load the executor from ProtBertExecutor
protein-search-backend | pod0@16[E]:ExecutorFailToLoad() during <class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> initialization
protein-search-backend | add "--quiet-error" to suppress the exception details
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 438, in _error_catcher
protein-search-backend | yield
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 519, in read
protein-search-backend | data = self._fp.read(amt) if not fp_closed else b""
protein-search-backend | File "/usr/local/lib/python3.8/http/client.py", line 459, in read
protein-search-backend | n = self.readinto(b)
protein-search-backend | File "/usr/local/lib/python3.8/http/client.py", line 503, in readinto
protein-search-backend | n = self.fp.readinto(b)
protein-search-backend | File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
protein-search-backend | return self._sock.recv_into(b)
protein-search-backend | File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
protein-search-backend | return self.read(nbytes, buffer)
protein-search-backend | File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
protein-search-backend | return self._sslobj.read(len, buffer)
protein-search-backend | ConnectionResetError: [Errno 104] Connection reset by peer
protein-search-backend |
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend |
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/requests/models.py", line 758, in generate
protein-search-backend | for chunk in self.raw.stream(chunk_size, decode_content=True):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 576, in stream
protein-search-backend | data = self.read(amt=amt, decode_content=decode_content)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 541, in read
protein-search-backend | raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
protein-search-backend | File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
protein-search-backend | self.gen.throw(type, value, traceback)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 455, in _error_catcher
protein-search-backend | raise ProtocolError("Connection broken: %r" % e, e)
protein-search-backend | urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend |
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend |
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1249, in from_pretrained
protein-search-backend | resolved_archive_file = cached_path(
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1363, in cached_path
protein-search-backend | output_path = get_from_cache(
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1626, in get_from_cache
protein-search-backend | http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, headers=headers)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1485, in http_get
protein-search-backend | for chunk in r.iter_content(chunk_size=1024):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/requests/models.py", line 761, in generate
protein-search-backend | raise ChunkedEncodingError(e)
protein-search-backend | requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend |
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend |
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 87, in _load_executor
protein-search-backend | self._executor = BaseExecutor.load_config(
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 553, in load_config
protein-search-backend | return JAML.load(tag_yml, substitute=False)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 89, in load
protein-search-backend | r = yaml.load(stream, Loader=JinaLoader)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/yaml/__init__.py", line 114, in load
protein-search-backend | return loader.get_single_data()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 51, in get_single_data
protein-search-backend | return self.construct_document(node)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 55, in construct_document
protein-search-backend | data = self.construct_object(node)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 100, in construct_object
protein-search-backend | data = constructor(self, node)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 426, in _from_yaml
protein-search-backend | return get_parser(cls, version=data.get('version', None)).parse(cls, data)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/parsers/executor/legacy.py", line 69, in parse
protein-search-backend | obj = cls(
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/executors/decorators.py", line 65, in arg_wrapper
protein-search-backend | f = func(self, *args, **kwargs)
protein-search-backend | File "/app/src/my_executors.py", line 24, in __init__
protein-search-backend | model = BertModel.from_pretrained("Rostlab/prot_bert")
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1266, in from_pretrained
protein-search-backend | raise EnvironmentError(msg)
protein-search-backend | OSError: Can't load weights for 'Rostlab/prot_bert'. Make sure that:
protein-search-backend |
protein-search-backend | - 'Rostlab/prot_bert' is a correct model identifier listed on 'https://huggingface.co/models'
protein-search-backend |
protein-search-backend | - or 'Rostlab/prot_bert' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.
protein-search-backend |
protein-search-backend |
protein-search-backend |
protein-search-backend | The above exception was the direct cause of the following exception:
protein-search-backend |
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 78, in run
protein-search-backend | runtime = runtime_cls(
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 59, in __init__
protein-search-backend | self._load_executor()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 104, in _load_executor
protein-search-backend | raise ExecutorFailToLoad from ex
protein-search-backend | jina.excepts.ExecutorFailToLoad
protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f49b6e16340> can not be started due to RuntimeFailToStart(), Flow is aborted
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "src/app.py", line 40, in <module>
protein-search-backend | main()
protein-search-backend | File "src/app.py", line 33, in main
protein-search-backend | with flow:
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 919, in __enter__
protein-search-backend | return self.start()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 964, in start
protein-search-backend | v.wait_start_success()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 510, in wait_start_success
protein-search-backend | p.wait_start_success()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 276, in wait_start_success
protein-search-backend | raise RuntimeFailToStart
protein-search-backend | jina.excepts.RuntimeFailToStart
Downloading: 2%|β | 34.9M/1.68G [00:19<18:03, 1.52MB/s]("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend | pod0@16[C]:can not load the executor from ProtBertExecutor
protein-search-backend | pod0@16[E]:ExecutorFailToLoad() during <class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> initialization
protein-search-backend | add "--quiet-error" to suppress the exception details
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 438, in _error_catcher
protein-search-backend | yield
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 519, in read
protein-search-backend | data = self._fp.read(amt) if not fp_closed else b""
protein-search-backend | File "/usr/local/lib/python3.8/http/client.py", line 459, in read
protein-search-backend | n = self.readinto(b)
protein-search-backend | File "/usr/local/lib/python3.8/http/client.py", line 503, in readinto
protein-search-backend | n = self.fp.readinto(b)
protein-search-backend | File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
protein-search-backend | return self._sock.recv_into(b)
protein-search-backend | File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
protein-search-backend | return self.read(nbytes, buffer)
protein-search-backend | File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
protein-search-backend | return self._sslobj.read(len, buffer)
protein-search-backend | ConnectionResetError: [Errno 104] Connection reset by peer
protein-search-backend |
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend |
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/requests/models.py", line 758, in generate
protein-search-backend | for chunk in self.raw.stream(chunk_size, decode_content=True):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 576, in stream
protein-search-backend | data = self.read(amt=amt, decode_content=decode_content)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 541, in read
protein-search-backend | raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
protein-search-backend | File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
protein-search-backend | self.gen.throw(type, value, traceback)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/urllib3/response.py", line 455, in _error_catcher
protein-search-backend | raise ProtocolError("Connection broken: %r" % e, e)
protein-search-backend | urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend |
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend |
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1249, in from_pretrained
protein-search-backend | resolved_archive_file = cached_path(
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1363, in cached_path
protein-search-backend | output_path = get_from_cache(
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1626, in get_from_cache
protein-search-backend | http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, headers=headers)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1485, in http_get
protein-search-backend | for chunk in r.iter_content(chunk_size=1024):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/requests/models.py", line 761, in generate
protein-search-backend | raise ChunkedEncodingError(e)
protein-search-backend | requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend |
protein-search-backend | During handling of the above exception, another exception occurred:
protein-search-backend |
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 87, in _load_executor
protein-search-backend | self._executor = BaseExecutor.load_config(
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 553, in load_config
protein-search-backend | return JAML.load(tag_yml, substitute=False)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 89, in load
protein-search-backend | r = yaml.load(stream, Loader=JinaLoader)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/yaml/__init__.py", line 114, in load
protein-search-backend | return loader.get_single_data()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 51, in get_single_data
protein-search-backend | return self.construct_document(node)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 55, in construct_document
protein-search-backend | data = self.construct_object(node)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/yaml/constructor.py", line 100, in construct_object
protein-search-backend | data = constructor(self, node)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/__init__.py", line 426, in _from_yaml
protein-search-backend | return get_parser(cls, version=data.get('version', None)).parse(cls, data)
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/jaml/parsers/executor/legacy.py", line 69, in parse
protein-search-backend | obj = cls(
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/executors/decorators.py", line 65, in arg_wrapper
protein-search-backend | f = func(self, *args, **kwargs)
protein-search-backend | File "/app/src/my_executors.py", line 24, in __init__
protein-search-backend | model = BertModel.from_pretrained("Rostlab/prot_bert")
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1266, in from_pretrained
protein-search-backend | raise EnvironmentError(msg)
protein-search-backend | OSError: Can't load weights for 'Rostlab/prot_bert'. Make sure that:
protein-search-backend |
protein-search-backend | - 'Rostlab/prot_bert' is a correct model identifier listed on 'https://huggingface.co/models'
protein-search-backend |
protein-search-backend | - or 'Rostlab/prot_bert' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.
protein-search-backend |
protein-search-backend |
protein-search-backend |
protein-search-backend | The above exception was the direct cause of the following exception:
protein-search-backend |
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 78, in run
protein-search-backend | runtime = runtime_cls(
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 59, in __init__
protein-search-backend | self._load_executor()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/runtimes/zmq/zed.py", line 104, in _load_executor
protein-search-backend | raise ExecutorFailToLoad from ex
protein-search-backend | jina.excepts.ExecutorFailToLoad
protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f5213e3ab80> can not be started due to RuntimeFailToStart(), Flow is aborted
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "src/app.py", line 40, in <module>
protein-search-backend | main()
protein-search-backend | File "src/app.py", line 33, in main
protein-search-backend | with flow:
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 919, in __enter__
protein-search-backend | return self.start()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 964, in start
protein-search-backend | v.wait_start_success()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 510, in wait_start_success
protein-search-backend | p.wait_start_success()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 276, in wait_start_success
protein-search-backend | raise RuntimeFailToStart
protein-search-backend | jina.excepts.RuntimeFailToStart
protein-search-backend exited with code 1
Notice the keywords ProtBertExecutor
, pod0
, and the logs of,
protein-search-backend | - 'Rostlab/prot_bert' is a correct model identifier listed on 'https://huggingface.co/models'
protein-search-backend |
protein-search-backend | - or 'Rostlab/prot_bert' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.
...
protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f49b6e16340> can not be started due to RuntimeFailToStart(), Flow is aborted
...
Downloading: 2%|β | 34.9M/1.68G [00:19<18:03, 1.52MB/s]("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
protein-search-backend | pod0@16[C]:can not load the executor from ProtBertExecutor
protein-search-backend | pod0@16[E]:ExecutorFailToLoad() during <class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> initialization
protein-search-backend | add "--quiet-error" to suppress the exception details
Please see this comment for an explanation over why these logs might be happening.
@fissoreg @georgeamccarthy just a completely crazy idea here, but imagine if there was a tool that made it easy to use dependencies
in such a way that, let's say the torch
dependency is used internally in transformers
- that's all fine until the torch
dependency is full of all these others things and gets to this gigantic size - not everyone has the internet bandwidth to use it just for one dependency nor is it a fast thing to setup on their machine - that I'm not just using as a user, and since I'm only using a subset of all the functionality that torch
provides to fulfill requirements in transformers
... what if there was a way or an architectural/structural idea that could help us to only retrieve a part of the codebase that is useful to me ... and leave everything else behind.
In some ideal world, that would be a much lighter installation method, way faster to setup and install, and will enforce ideas about keeping everything organized in teams so two people may not cross over into each other's work.
Finally arriving at newer logs without any connection loss,
Downloading: 100%|ββββββββββ| 81.0/81.0 [00:00<00:00, 35.7kB/s]
Downloading: 100%|ββββββββββ| 112/112 [00:00<00:00, 44.3kB/s]
Downloading: 100%|ββββββββββ| 86.0/86.0 [00:00<00:00, 32.5kB/s]
Downloading: 100%|ββββββββββ| 361/361 [00:00<00:00, 124kB/s]
Downloading: 64%|βββββββ | 1.07G/1.68G [09:45<05:46, 1.77MB/s] pod0@ 1[W]:<class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> timeout after waiting for 600000ms, if your executor takes time to load, you may increase --timeout-ready
protein-search-backend | pod0@ 1[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
Downloading: 66%|βββββββ | 1.10G/1.68G [29:52<05:10, 1.87MB/s] pod0@ 1[W]:Terminating process after waiting for readiness signal for graceful shutdown
protein-search-backend | pod0@ 1[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
protein-search-backend | pod0@ 1[W]:Terminating process after waiting for readiness signal for graceful shutdown
protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f15096275b0> can not be started due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'), Flow is aborted
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "src/app.py", line 43, in <module>
protein-search-backend | main()
protein-search-backend | File "src/app.py", line 36, in main
protein-search-backend | with flow:
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 919, in __enter__
protein-search-backend | return self.start()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 964, in start
protein-search-backend | v.wait_start_success()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 510, in wait_start_success
protein-search-backend | p.wait_start_success()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 288, in wait_start_success
protein-search-backend | raise TimeoutError(
protein-search-backend | TimeoutError: jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms
Downloading: 64%|βββββββ | 1.08G/1.68G [09:49<05:03, 2.00MB/s] pod0@ 1[W]:<class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> timeout after waiting for 600000ms, if your executor takes time to load, you may increase --timeout-ready
protein-search-backend | pod0@ 1[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
Downloading: 67%|βββββββ | 1.13G/1.68G [10:18<04:56, 1.86MB/s] pod0@ 1[W]:Terminating process after waiting for readiness signal for graceful shutdown
protein-search-backend | pod0@ 1[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
protein-search-backend | pod0@ 1[W]:Terminating process after waiting for readiness signal for graceful shutdown
protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f5840f3ff70> can not be started due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'), Flow is aborted
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "src/app.py", line 43, in <module>
protein-search-backend | main()
protein-search-backend | File "src/app.py", line 36, in main
protein-search-backend | with flow:
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 919, in __enter__
protein-search-backend | return self.start()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 964, in start
protein-search-backend | v.wait_start_success()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 510, in wait_start_success
protein-search-backend | p.wait_start_success()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 288, in wait_start_success
protein-search-backend | raise TimeoutError(
protein-search-backend | TimeoutError: jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms
protein-search-backend exited with code 1
Downloading: 63%|βββββββ | 1.06G/1.68G [09:49<06:46, 1.54MB/s] pod0@ 1[W]:<class 'jina.peapods.runtimes.zmq.zed.ZEDRuntime'> timeout after waiting for 600000ms, if your executor takes time to load, you may increase --timeout-ready
protein-search-backend | pod0@ 1[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
Downloading: 100%|ββββββββββ| 1.68G/1.68G [15:41<00:00, 1.79MB/s]
protein-search-backend | Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
protein-search-backend | - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
protein-search-backend | - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f14434f9f70> can not be started due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'), Flow is aborted
protein-search-backend | Traceback (most recent call last):
protein-search-backend | File "src/app.py", line 43, in <module>
protein-search-backend | main()
protein-search-backend | File "src/app.py", line 36, in main
protein-search-backend | with flow:
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 919, in __enter__
protein-search-backend | return self.start()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/flow/base.py", line 964, in start
protein-search-backend | v.wait_start_success()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 510, in wait_start_success
protein-search-backend | p.wait_start_success()
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 288, in wait_start_success
protein-search-backend | raise TimeoutError(
protein-search-backend | TimeoutError: jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms
protein-search-backend exited with code 1
protein-search-backend | Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias']
protein-search-backend | - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
protein-search-backend | - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
protein-search-backend | pod0@ 1[L]:ready and listening
protein-search-backend | pod1@ 1[L]:ready and listening
protein-search-backend | gateway@ 1[L]:ready and listening
protein-search-backend | Flow@ 1[I]:π Flow is ready to use!
protein-search-backend | π Protocol: HTTP
protein-search-backend | π Local access: 0.0.0.0:8020
protein-search-backend | π Private network: 192.168.16.2:8020
protein-search-backend | π¬ Swagger UI: http://localhost:8020/docs
protein-search-backend | π Redoc: http://localhost:8020/redoc
So my assumption was correct previously, ProtBertExecutor
is trying to download the Rostlab/prot_bert
pre-trained model first before it does anything, hence the timeout. Notice that at the line,
Downloading: 100%|ββββββββββ| 1.68G/1.68G [15:41<00:00, 1.79MB/s]
After this, the errors start to go away on their own.
To solve this, I can make a quick static function that downloads the model before we execute the below lines,
flow = (
Flow(port_expose=8020, protocol='http')
.add(uses=ProtBertExecutor)
.add(uses=MyIndexer)
)
And it should then work as expected.
So I've spent time trying to figure out the
jinaai/jina:latest
docker image. I've found it is only thecli
, and I don't think it might even have Python's base libraries - I could not runpython
with it.
I think this is the Dockerfile for jina:latest
. It's a stripped-down Arch installation that provides Python 3.7 and Jina.
You can run it like this:
docker run -it --entrypoint /bin/bash jinaai/jina:latest
and you will have access to the shell. I tried it and you can run python
on it.
Anyways, if you use it as a base image, it should behave the same as the python slim-buster
image, with the advantage that you have a well-configured Jina installation already available.
@fissoreg Since the execution of the backend startes with
src/app.py
, I wish to runpython src/app.py
for starting the backend. Do you know the equivalent of this with thejina cli
?
You might have to reset the ENTRYPOINT
and play with the CMD
in the Dockerfile
. Then you can start the container and the backend will automatically be run. No need for the jina cli
, I think. Something similar to (untested):
ENTRYPOINT ""
CMD [ "python", "src/app.py" ]
Great investigation!
To solve this, I can make a quick static function that downloads the model before we execute the below lines,
flow = ( Flow(port_expose=8020, protocol='http') .add(uses=ProtBertExecutor) .add(uses=MyIndexer) )
There's no need to build a Flow
to run the executor, we can just instantiate ProtBertExecutor
directly.
Anyways, we should think how to go about it. Do we re-download the pretrained model every time the backend is run? Or should we cache it in some way into the Docker image?
I'm testing it out right now to see what the flow of that would be.
By definition, Rostlab/prot_bert
only needs to be downloaded once into the container and it can reused until we forcefully delete the container's volume mapped space.
However, this is where the problem is. If we make the downloading of the pre-trained model part of the backend build and execution, it will still have to be done every time it is built from scratch.
An easy solution would be to write a small script that,
https://huggingface.co/
into some local directory on the user's machinebackend
We can add a line in .gitignore
to ignore this because pushing such a large model onto GitHub is impractical.
With this, we can simply give the from_pretrained
function the path of the model. Remember the logs above that mentioned these lines,
protein-search-backend | model = BertModel.from_pretrained("Rostlab/prot_bert")
protein-search-backend | File "/home/jina/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1266, in from_pretrained
protein-search-backend | raise EnvironmentError(msg)
protein-search-backend | OSError: Can't load weights for 'Rostlab/prot_bert'. Make sure that:
protein-search-backend |
protein-search-backend | - 'Rostlab/prot_bert' is a correct model identifier listed on 'https://huggingface.co/models'
protein-search-backend |
protein-search-backend | - or 'Rostlab/prot_bert' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.
Specifically, the "or 'Rostlab/prot_bert' is the correct path to a directory ..." which seems to indicate that from_pretrained
is alright with being given a local directory path.
I'm testing it out right now to see what the flow of that would be.
By definition,
Rostlab/prot_bert
only needs to be downloaded once into the container and it can reused until we forcefully delete the container's volume mapped space.However, this is where the problem is. If we make the downloading of the pre-trained model part of the backend build and execution, it will still have to be done every time it is built from scratch.
What do you mean with ...every time it is built from scratch.
? If you mean the build step, then I don't think that's a problem (rather a good solution!). I would avoid writing scripts to manage the download when the from_pretrained
function does everything already (and probably more reliably then what we could end up with, if we don't put a lot of effort into it).
Yes, I meant the build steps.
HuggingFace provides ways to get the model from their website. As mentioned here under the button of "Use In Transformers", there are instructions for dealing with the model repository,
git lfs install
git clone https://huggingface.co/Rostlab/prot_bert
# if you want to clone without large files β just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1
HuggingFace provides ways to get the model from their website. As mentioned here under the button of "Use In Transformers", there are instructions for dealing with the model repository,
I like this solution! I think it would be nice to use a local copy of the model. Suggestion: In the Dockerfile clone the model and modify the code to use the a model. Also include a check for the local model so if it is not found (e.g. because protein_search has been cloned from GitHub instead of from Docker Hub then get ProtBert over the net).
HuggingFace provides ways to get the model from their website. As mentioned here under the button of "Use In Transformers", there are instructions for dealing with the model repository,
Sure, but then why don't we stick to using the transformers
library as we are currently doing? That's maintained by Huggingface and it does exactly what we are discussing about in this issue (i.e. download the model and cache it for future use). Cloning and managing the repo is more laborious and more prone to all kinds of problems (versioning, download, path management, caching...). Writing code to manage all that means reinventing the wheel. If we don't have a very good reason to directly deal with the repository, I would avoid it.
Had a chat to Gian and the summary of our thoughts was: please try and implement a local caching of the model using the transformers library. https://huggingface.co/transformers/
We can load the model and then save it into a local dir using transformers.PreTrainedModel.save_pretrained() and add that to the Docker.
from transformers import BertModel
model = BertModel.from_pretrained("Rostlab/prot_bert")
model.save_pretrained('./models/prot_bert')
Noted, @georgeamccarthy, @fissoreg. Thanks for your time today! And the patience on your guys end with this PR, this is turning out to be a bit challenging than I initially maybe realized, but I'm having fun and learning about a completely new technology (Jina AI) as well.
I'll be taking the approach as mentioned above and will get back to you guys if I run into any problems.
Thanks again, as always!
No worries it's been really interesting and we're grateful for your continued efforts! Soon to be a contributor no doubt π
Another log I came across was this,
protein-search-backend | Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
protein-search-backend | - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
protein-search-backend | - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
I believe we are in the "This IS expected" group, right?
I'm not sure it is expected, but it is definitely something we noticed before and it doesn't pose problems in running our model. So we just ignored it for now...
I see. Benchmarks needed in future if there is ever a need to look into how to utilize the best from the weight
factor.
Gentleman, the caching works and is implemented. Sending a request from my local machine to the container gets me back a gigantic response,
I ran the below code in my local environment,
import requests as r
res = r.post('http://backend:8020/search', headers={'Content-Type': 'application/json'}, data='{"data": [{"text": "AETCZAO" }]}' )
Three bugs I can think of right now,
results
folder needs to be manually created right now. I'll just make a folder for that - FIXEDfrontend
container cannot establish a connection with the backend
container, but I can from my local host machine - FIXEDself.model
and self.tokenizer
before running .add(uses=ProtBertExecutor)
. This will be solution to the problem of when the user is building the images for the first time for the cache, the jina
presents a timeout because the Executor
is downloading the model and the tokenizer, but jina
is great at restoring itself when the files are finally finished downloading. An easy solution might be to write a static function that initializes the two variables, so we would be essentially moving the logic of __init__
to a static function - FIXEDI don't see the new code, I guess you didn't push it?
3. I would like to initiate `self.model` and `self.tokenizer` before running `.add(uses=ProtBertExecutor)`. This will be solution to the problem of when the user is building the images for the first time for the cache, the `jina` presents a timeout because the `Executor` is downloading the model and the tokenizer, but `jina` is great at restoring itself when the files are finally finished downloading. An easy solution might be to write a static function that initializes the two variables, so we would be essentially moving the logic of `__init__` to a static function
I thought this was the point of @georgeamccarthy's comment above, where he proposes to use:
from transformers import BertModel
model = BertModel.from_pretrained("Rostlab/prot_bert")
model.save_pretrained('./models/prot_bert')
When to call those lines is still a problem. Either during the Docker image build, or at some point during deployment. For now we can be ok with either (at a certain point, we will want to integrate all of this in some kind of initialisation that also computes the embeddings and does the indexing).
@fissoreg, just pushed.
I thought this was the point of @georgeamccarthy's comment above, where he proposes to use:
This is done and implemented with caching logic introduced in ProtBertExecutor.__init__
When to call those lines is still a problem. Either during the Docker image build, or at some point during deployment. For now we can be ok with either (at a certain point, we will want to integrate all of this in some kind of initialisation that also computes the embeddings and does the indexing).
We can do it before the Flow starts. This way, it will make sure that the model+tokenizer exist - if they do not, fetch and cache them - and then we can go on with the flow as usual. The Docker image takes care of providing a way that lets the model+tokenizer exist locally as well as in the backend
container, so the question of how
is solved, and what remains is the question of when
to check if those exist or not.
A suggested change will require a static function that can be called as such, just before the Flow
, so we get something like
def main():
url = dataset_url
pdb_data_path = protein_path
with load_or_download(url, pdb_data_path) as data_file:
docs_generator = from_csv(
fp=data_file, field_resolver={"sequence": "text", "structureId": "id"}
)
proteins = DocumentArray(docs_generator)[0:42]
ProtBertExecutor.initialize_executor()
flow = (
Flow(port_expose=8020, protocol="http")
.add(uses=ProtBertExecutor)
.add(uses=MyIndexer)
)
with flow:
if not os.path.exists(embeddings_path):
flow.index(proteins)
flow.block()
Which is responsible for setting the self.model
and self.tokenizer
.
A part of this is the idea of removing all of the logic from __init__
and moving it to the static method, initialize_executor
, which means the __init__
will essentially be empty.
So I removed all the cache, the results
, data
, model
, tokenizer
, embeddings
, and started make docker
with no cache.
On the first round of running make docker
, the implementation gives the below logs. No more errors from jina
, hurrah. :partying_face: :tada:
On the second round of running make up
, the implementation generates the logs,
And the frontend works as expected now. :rocket:
Finalized PR. Help needed in resolving merge conflicts.
@georgeamccarthy @fissoreg do you guys think this is good to go? We can maybe have a 1-to-1 over how to deal with all the conflicting files together. :+1:
@Rubix982 this all looks great, hope to merge as soon as possible!
I merged on my side and resolved conflicts, but I'm having some troubles with the backend. I'm gonna investigate that and commit the merge here, probably on Monday. Thanks for the effort, and enjoy the weekend! ;)
Great stuff guys love it! @Rubix982 do you agree to the terms of the contributing agreement? Have a quick read
I agree to the contributing agreement, yes. Do I have to make a signature somewhere?
Also. if you guys plan on enforcing the contributing agreement, look into the CLA Assisstant that prevents any PRs and blocks them until they sign. It's similar to how FB will block your PR on our repository until you have signed the Facebook CLA agreement.
That's cracking and thanks for the link! No signature needed π
@georgeamccarthy, @fissoreg I'm unable to merge by myself. Need help with that.
@fissoreg Unfortunately, there is no easy way of using YAML variables inside Dockerfiles. The only way is to use .env
files at the moment. As for executors.py
, we can use the python-dotenv
package to import the variables from the .env into a python script.
@georgeamccarthy, @fissoreg I'm unable to merge by myself. Need help with that.
The merge is here: https://github.com/Rubix982/protein_search/pull/1 There are some problems, but you can merge that PR so we have the merge commit here and we can keep the discussion flowing. Have a look before merging, to see if all the Dockerfile stuff looks good to you.
@fissoreg Unfortunately, there is no easy way of using YAML variables inside Dockerfiles. The only way is to use
.env
files at the moment. As forexecutors.py
, we can use thepython-dotenv
package to import the variables from the .env into a python script.
Ok thanks @Rubix982! Maybe .env files is a good option. If you have other ideas, let's discuss. The important point is that it would be better to have to parametrize the various paths only once and in one place.
This is weird. I did not get a notification on my own repository. Thanks.
I'll try to implement the .env
method on my repository and get back to you.
I'll try to implement the
.env
method on my repository and get back to you.
This is great, thanks! But let's make that into a different PR, maybe? So we can merge this one ASAP.
After making the changes suggested by Cristian on Slack, I am able to finally get results from the endpoint /search
.
The changes have been made and pushed to my fork.
After the fixes, the Streamlit application throws these errors,
These errors are thrown from line 95 of frontend/app.py
,
# Execute the query on the transport
result = client.execute(query, variable_values={"ids": ids})
I believe these bug fixes are independent of this PR's objective. This should be merged and closed, and the issue solved in another PR. What do you think, @fissoreg?
Agreed, let's merge and move forward.
I believe these bug fixes are independent of this PR's objective. This should be merged and closed, and the issue solved in another PR. What do you think, @fissoreg?
I didn't get this error but yes, let's merge and move on. We will also need to fix the automated tests.
Great job @Rubix982 !
Pull Request Type
Purpose
Why?
Changes Introduced
requirements.txt
from root, splits dependencies amongst thebackend
and thefrontend
, by creating individualrequirements.txt
data/
from the root intobackend/
*.py
files inbackend
tobackend/src/
Docker Hub
. This is the repository for the frontend, and the backendBugs (WIP)
Errno 111 - Connection Refused
aiohttp
- to be added inrequirements.txt
Notes
backend
container is gigantic, it's near 1 GB due to thetorch
dependency (831MB). I was able to cache the containers, which means you will only need to install the requirements once for both the containers, and it should practically load them given that the dependencies have not changeddocker
,docker-compose
on a machine. The containers can be built and started with runningmake docker
in the root. They can be temporarily closed withCtrl^C
, started again withmake up
, and removed withmake remove
jina
in the Dockerfile was created becausepip
does not like installing as rootpdb_data_seq.csv
which is 10K lines long (hence so much green in this PR), I'm not sure why it did thatFeedback required over
Mentions