code-kern-ai / refinery

The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
https://www.kern.ai
Apache License 2.0
1.39k stars 66 forks source link

[BUG] - Faulty attribute type and follow up errors #275

Closed andhreljaKern closed 8 months ago

andhreljaKern commented 9 months ago

Describe the bug I uploaded a CSV file that contains multiple ID attributes (id, user_id, user_id_str). The IDs are numbers that contain 19 or more digits (e.g. 1614977053542981638) and are being casted as integers:

image

I am trying to run embedding on the full_text attribute - it hangs indefinitely with Initializing status:

image

The UI does not print an error, but the refinery-embedder container logs it:

2023-11-23 09:22:42 --- Running on CPU. If you're facing performance issues, you should consider switching to a CUDA device
2023-11-23 09:23:43 INFO:     172.21.0.26:51126 - "GET /classification/recommend/TEXT HTTP/1.1" 200 OK
2023-11-23 09:45:20 INFO:     172.21.0.26:49914 - "GET /classification/recommend/TEXT HTTP/1.1" 200 OK
2023-11-23 09:45:41 INFO:     172.21.0.26:56300 - "POST /embed HTTP/1.1" 200 OK
2023-11-23 09:46:04 INFO:     172.21.0.26:39562 - "GET /classification/recommend/TEXT HTTP/1.1" 200 OK
2023-11-23 09:46:20 INFO:     172.21.0.26:49828 - "DELETE /delete/752587b6-6130-4424-a416-7b1ce6e62c44/932176de-8174-4fcd-8249-d02fb78eb863 HTTP/1.1" 200 OK
2023-11-23 09:22:42 INFO:     Started server process [1]
2023-11-23 09:22:42 INFO:     Waiting for application startup.
2023-11-23 09:22:42 INFO:     Application startup complete.
2023-11-23 09:22:42 INFO:     Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)
2023-11-23 09:45:43 INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: bert-base-uncased
Downloading .gitattributes: 100%|██████████| 491/491 [00:00<00:00, 87.8kB/s]
Downloading LICENSE: 100%|██████████| 11.4k/11.4k [00:00<00:00, 5.59MB/s]
Downloading README.md: 100%|██████████| 10.5k/10.5k [00:00<00:00, 25.2MB/s]
Downloading config.json: 100%|██████████| 570/570 [00:00<00:00, 664kB/s]
Downloading (…)CoreML/model.mlmodel: 100%|██████████| 165k/165k [00:00<00:00, 863kB/s]
Downloading weight.bin: 100%|██████████| 532M/532M [01:26<00:00, 6.18MB/s] 
Downloading (…)ackage/Manifest.json: 100%|██████████| 617/617 [00:00<00:00, 368kB/s]
2023-11-23 09:47:21 INFO:     172.21.0.26:40500 - "POST /embed HTTP/1.1" 200 OK
Downloading model.onnx: 100%|██████████| 532M/532M [03:03<00:00, 2.90MB/s] 
Downloading model.safetensors: 100%|██████████| 440M/440M [02:24<00:00, 3.05MB/s] 
Downloading pytorch_model.bin: 100%|██████████| 440M/440M [02:48<00:00, 2.62MB/s] 
Downloading tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.27MB/s]
Downloading tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 42.8kB/s]
Downloading vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.07MB/s]
2023-11-23 09:55:29 WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/bert-base-uncased. Creating a new one with MEAN pooling.
2023-11-23 09:55:32 Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
2023-11-23 09:55:32 - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
2023-11-23 09:55:32 - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2023-11-23 09:55:32 INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu
2023-11-23 09:55:32 Traceback (most recent call last):
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
2023-11-23 09:55:32     self.dialect.do_execute(
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
2023-11-23 09:55:32     cursor.execute(statement, parameters)
2023-11-23 09:55:32 psycopg2.errors.NumericValueOutOfRange: value "1614977053542981638" is out of range for type integer
2023-11-23 09:55:32 
2023-11-23 09:55:32 
2023-11-23 09:55:32 The above exception was the direct cause of the following exception:
2023-11-23 09:55:32 
2023-11-23 09:55:32 Traceback (most recent call last):
2023-11-23 09:55:32   File "/program/controller.py", line 252, in run_encoding
2023-11-23 09:55:32     record_ids, attribute_values_raw = record.get_attribute_data(
2023-11-23 09:55:32   File "/program/submodules/model/business_objects/record.py", line 403, in get_attribute_data
2023-11-23 09:55:32     result = general.execute_all(query)
2023-11-23 09:55:32   File "/program/submodules/model/business_objects/general.py", line 61, in execute_all
2023-11-23 09:55:32     return session.execute(sql).all()
2023-11-23 09:55:32   File "<string>", line 2, in execute
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1714, in execute
2023-11-23 09:55:32     result = conn._execute_20(statement, params or {}, execution_options)
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
2023-11-23 09:55:32     return meth(self, args_10style, kwargs_10style, execution_options)
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 333, in _execute_on_connection
2023-11-23 09:55:32     return connection._execute_clauseelement(
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
2023-11-23 09:55:32     ret = self._execute_context(
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
2023-11-23 09:55:32     self._handle_dbapi_exception(
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2124, in _handle_dbapi_exception
2023-11-23 09:55:32     util.raise_(
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
2023-11-23 09:55:32     raise exception
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
2023-11-23 09:55:32     self.dialect.do_execute(
2023-11-23 09:55:32   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
2023-11-23 09:55:32     cursor.execute(statement, parameters)
2023-11-23 09:55:32 sqlalchemy.exc.DataError: (psycopg2.errors.NumericValueOutOfRange) value "1614977053542981638" is out of range for type integer
2023-11-23 09:55:32 
2023-11-23 09:55:32 [SQL: 
2023-11-23 09:55:32         SELECT id::TEXT, data::JSON->'full_text' AS "full_text"
2023-11-23 09:55:32         FROM record
2023-11-23 09:55:32         WHERE project_id = '752587b6-6130-4424-a416-7b1ce6e62c44'
2023-11-23 09:55:32         ORDER BY (data->>'id')::INTEGER, (data->>'user_id')::INTEGER, (data->>'user_id_str')::INTEGER
2023-11-23 09:55:32         ]

To Reproduce Steps to reproduce the behavior:

  1. Run refinery locally
  2. Navigate to http://localhost:4455/refinery/projects
  3. Create a new project
  4. Upload attached file
  5. Generate Embedding
    • Target Attribute: full_text
    • Filter Attributes: None
    • Platform: HuggingFace/Python/Any
    • Granularity: Attribute/Token/Any
    • Model: distilbert-base-uncased/Any

Expected behavior Embedding model is downloaded and embeddings are generated successfully. If an error similar to this occurs, I am notified about it.

Desktop (please complete the following information):

tweets-300.csv.gz