georgeamccarthy / protein_search

The neural search engine for proteins.
GNU Affero General Public License v3.0
15 stars 6 forks source link

ProtBertExecutor cannot handle proteins of different lengths. #10

Closed georgeamccarthy closed 3 years ago

georgeamccarthy commented 3 years ago

To reproduce: Change input data file from samelength.csv which contains only protein sequences of the same length to a data file with protein sequences of different (e.g. Train_HHblits_1column_short.csv). Error: pod0@8742[E]:ValueError("Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.") add "--quiet-error" to suppress the exception details Traceback (most recent call last): File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 699, in convert_to_tensors tensor = as_tensor(value) ValueError: expected sequence of length 332 at dim 1 (got 368)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/jina/peapods/runtimes/zmq/zed.py", line 289, in _msg_callback self._zmqlet.send_message(self._callback(msg)) File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/jina/peapods/runtimes/zmq/zed.py", line 275, in _callback self._pre_hook(msg)._handle(msg)._post_hook(msg) File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/jina/peapods/runtimes/zmq/zed.py", line 221, in _handle r_docs = self._executor( File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/jina/executors/__init__.py", line 187, in __call__ return self.requests[__default_endpoint__]( File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/jina/executors/decorators.py", line 103, in arg_wrapper return fn(*args, **kwargs) File "/Users/georgeamccarthy/Documents/workspace/python/protein_search/protein_search/backend/my_executors.py", line 29, in encode encoded_inputs = self.tokenizer(sequences, return_tensors="pt") File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2310, in __call__ return self.batch_encode_plus( File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2495, in batch_encode_plus return self._batch_encode_plus( File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 549, in _batch_encode_plus batch_outputs = self._batch_prepare_for_model( File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 629, in _batch_prepare_for_model batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors) File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 204, in __init__ self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis) File "/Users/georgeamccarthy/opt/anaconda3/envs/jina/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 715, in convert_to_tensors raise ValueError( ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

fissoreg commented 3 years ago

This can be solved by properly addressing #5.