Describe the bug
I uploaded a CSV file that contains multiple ID attributes (id, user_id, user_id_str). The IDs are numbers that contain 19 or more digits (e.g. 1614977053542981638) and are being casted as integers:
I am trying to run embedding on the full_text attribute - it hangs indefinitely with Initializing status:
The UI does not print an error, but the refinery-embedder container logs it:
2023-11-23 09:22:42 --- Running on CPU. If you're facing performance issues, you should consider switching to a CUDA device
2023-11-23 09:23:43 INFO: 172.21.0.26:51126 - "GET /classification/recommend/TEXT HTTP/1.1" 200 OK
2023-11-23 09:45:20 INFO: 172.21.0.26:49914 - "GET /classification/recommend/TEXT HTTP/1.1" 200 OK
2023-11-23 09:45:41 INFO: 172.21.0.26:56300 - "POST /embed HTTP/1.1" 200 OK
2023-11-23 09:46:04 INFO: 172.21.0.26:39562 - "GET /classification/recommend/TEXT HTTP/1.1" 200 OK
2023-11-23 09:46:20 INFO: 172.21.0.26:49828 - "DELETE /delete/752587b6-6130-4424-a416-7b1ce6e62c44/932176de-8174-4fcd-8249-d02fb78eb863 HTTP/1.1" 200 OK
2023-11-23 09:22:42 INFO: Started server process [1]
2023-11-23 09:22:42 INFO: Waiting for application startup.
2023-11-23 09:22:42 INFO: Application startup complete.
2023-11-23 09:22:42 INFO: Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)
2023-11-23 09:45:43 INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: bert-base-uncased
Downloading .gitattributes: 100%|██████████| 491/491 [00:00<00:00, 87.8kB/s]
Downloading LICENSE: 100%|██████████| 11.4k/11.4k [00:00<00:00, 5.59MB/s]
Downloading README.md: 100%|██████████| 10.5k/10.5k [00:00<00:00, 25.2MB/s]
Downloading config.json: 100%|██████████| 570/570 [00:00<00:00, 664kB/s]
Downloading (…)CoreML/model.mlmodel: 100%|██████████| 165k/165k [00:00<00:00, 863kB/s]
Downloading weight.bin: 100%|██████████| 532M/532M [01:26<00:00, 6.18MB/s]
Downloading (…)ackage/Manifest.json: 100%|██████████| 617/617 [00:00<00:00, 368kB/s]
2023-11-23 09:47:21 INFO: 172.21.0.26:40500 - "POST /embed HTTP/1.1" 200 OK
Downloading model.onnx: 100%|██████████| 532M/532M [03:03<00:00, 2.90MB/s]
Downloading model.safetensors: 100%|██████████| 440M/440M [02:24<00:00, 3.05MB/s]
Downloading pytorch_model.bin: 100%|██████████| 440M/440M [02:48<00:00, 2.62MB/s]
Downloading tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.27MB/s]
Downloading tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 42.8kB/s]
Downloading vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.07MB/s]
2023-11-23 09:55:29 WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/bert-base-uncased. Creating a new one with MEAN pooling.
2023-11-23 09:55:32 Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
2023-11-23 09:55:32 - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
2023-11-23 09:55:32 - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2023-11-23 09:55:32 INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu
2023-11-23 09:55:32 Traceback (most recent call last):
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
2023-11-23 09:55:32 self.dialect.do_execute(
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
2023-11-23 09:55:32 cursor.execute(statement, parameters)
2023-11-23 09:55:32 psycopg2.errors.NumericValueOutOfRange: value "1614977053542981638" is out of range for type integer
2023-11-23 09:55:32
2023-11-23 09:55:32
2023-11-23 09:55:32 The above exception was the direct cause of the following exception:
2023-11-23 09:55:32
2023-11-23 09:55:32 Traceback (most recent call last):
2023-11-23 09:55:32 File "/program/controller.py", line 252, in run_encoding
2023-11-23 09:55:32 record_ids, attribute_values_raw = record.get_attribute_data(
2023-11-23 09:55:32 File "/program/submodules/model/business_objects/record.py", line 403, in get_attribute_data
2023-11-23 09:55:32 result = general.execute_all(query)
2023-11-23 09:55:32 File "/program/submodules/model/business_objects/general.py", line 61, in execute_all
2023-11-23 09:55:32 return session.execute(sql).all()
2023-11-23 09:55:32 File "<string>", line 2, in execute
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1714, in execute
2023-11-23 09:55:32 result = conn._execute_20(statement, params or {}, execution_options)
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
2023-11-23 09:55:32 return meth(self, args_10style, kwargs_10style, execution_options)
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 333, in _execute_on_connection
2023-11-23 09:55:32 return connection._execute_clauseelement(
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
2023-11-23 09:55:32 ret = self._execute_context(
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
2023-11-23 09:55:32 self._handle_dbapi_exception(
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2124, in _handle_dbapi_exception
2023-11-23 09:55:32 util.raise_(
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
2023-11-23 09:55:32 raise exception
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
2023-11-23 09:55:32 self.dialect.do_execute(
2023-11-23 09:55:32 File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
2023-11-23 09:55:32 cursor.execute(statement, parameters)
2023-11-23 09:55:32 sqlalchemy.exc.DataError: (psycopg2.errors.NumericValueOutOfRange) value "1614977053542981638" is out of range for type integer
2023-11-23 09:55:32
2023-11-23 09:55:32 [SQL:
2023-11-23 09:55:32 SELECT id::TEXT, data::JSON->'full_text' AS "full_text"
2023-11-23 09:55:32 FROM record
2023-11-23 09:55:32 WHERE project_id = '752587b6-6130-4424-a416-7b1ce6e62c44'
2023-11-23 09:55:32 ORDER BY (data->>'id')::INTEGER, (data->>'user_id')::INTEGER, (data->>'user_id_str')::INTEGER
2023-11-23 09:55:32 ]
Describe the bug I uploaded a CSV file that contains multiple ID attributes (id, user_id, user_id_str). The IDs are numbers that contain 19 or more digits (e.g.
1614977053542981638
) and are being casted as integers:I am trying to run embedding on the
full_text
attribute - it hangs indefinitely withInitializing
status:The UI does not print an error, but the refinery-embedder container logs it:
To Reproduce Steps to reproduce the behavior:
full_text
None
HuggingFace/Python/Any
Attribute/Token/Any
distilbert-base-uncased/Any
Expected behavior Embedding model is downloaded and embeddings are generated successfully. If an error similar to this occurs, I am notified about it.
Desktop (please complete the following information):
tweets-300.csv.gz