druid-io / pydruid

A Python connector for Druid
Other
506 stars 194 forks source link

"\\" in row breaks PyDruid - JSONDecodeError('Unterminated string...') #242

Open dklei opened 3 years ago

dklei commented 3 years ago

Hi,

I'm using pydruid.db.connector to run a query that pulls a row where the content that is returned ends in "...\\", and this appears to break pydruid, meaning it either drops rows from the data or fails with a JSONDecodeError.

e.g. "SELECT x FROM y" -> [{"x": "some row"},{"x": "...\\"},{"x": "another row"},{"x": "more rows"}]

2020-11-27 10:44:23: [CRITICAL] JSONDecodeError('Unterminated string starting at: line 1 column 85919 (char 85918)') 2020-11-27 10:44:23: [CRITICAL] Traceback (most recent call last): File "xxxxx", line 291, in main data_paths = pull_data(tracker.last_data_dt, tracker.next_data_dt) File "xxxxx", line 162, in pull_data data_path = collector.execute_and_save() File "xxxxx", line 226, in execute_and_save for i, row in enumerate(cursor): File "xxxxx", line 181, in _get_cursor raise err File "xxxxx", line 164, in _get_cursor raise err File "xxxxx", line 161, in _get_cursor r = next(cursor) File "/xxxx/venv/lib64/python3.8/site-packages/pydruid/db/api.py", line 62, in g return f(self, *args, kwargs) File "/xxxx/venv/lib64/python3.8/site-packages/pydruid/db/api.py", line 320, in next return next(self._results) File "/xxxx/venv/lib64/python3.8/site-packages/pydruid/db/api.py", line 370, in _stream_query for row in rows_from_chunks(chunks): File "/xxxx/venv/lib64/python3.8/site-packages/pydruid/db/api.py", line 420, in rows_from_chunks for row in json.loads( File "/usr/lib64/python3.8/json/init.py", line 370, in loads return cls(kw).decode(s) File "/usr/lib64/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python3.8/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 85919 (char 85918)

Any rows proceeding the {"x": "...\\"} either do not return data, or return a JSONDecodeError. I'm guessing this is because pydruid.db.api.rows_from_chunks tries to parse the JSON itself, and looks for "\\" as end of strings?

I have attached a script and a dummy JSON file (scratch.zip) that shows the rows being dropped by the function but this does not trigger the JSONDecodeError - this appears to only trigger when I try to read this row and the surrounding rows from the database.

Many thanks in advance

tvamsisai commented 1 year ago

Hey, I'm hitting this issue. Is it possible to fix this soon? This is quite a severe issue as it fails silently.

@gianm @mistercrunch

ahiijny commented 6 months ago

This is still an issue in v0.6.6:

>>> from importlib.metadata import version
>>> version('pydruid')
'0.6.6'

To replicate:

from pydruid.db.api import rows_from_chunks

bad_json = """[
    {
        "id": 1,
        "value": "hi"
    },
    {
        "id": 2,
        "value": "C:\\\\"
    },
    {
        "id": 3,
        "value": "this row is missing..."
    }
]"""

for row in rows_from_chunks([bad_json]):
    print(f"row from bad json: {row}")

print("that's all!")

This prints:

row from bad json: OrderedDict([('id', 1), ('value', 'hi')])
that's all!

There are rows missing!

The suggested change in #262 seems to fix this problem. If I paste in the updated function definition from that PR and then rerun the above script, it prints the expected result:

row from bad json: OrderedDict([('id', 1), ('value', 'hi')])
row from bad json: OrderedDict([('id', 2), ('value', 'C:\\')])
row from bad json: OrderedDict([('id', 3), ('value', 'this row is missing...')])
that's all!