dataiku / dss-plugin-splunk

A connector to Splunk
https://www.dataiku.com/product/plugins/splunk/
Apache License 2.0
0 stars 0 forks source link

Character encoding #2

Open Schiggy-3000 opened 1 year ago

Schiggy-3000 commented 1 year ago

Hi everyone,

First off, thank you for providing this app =)

I regularly can't import data from Splunk due to encoding errors. E.g. data in Splunk is UTF-8 encoded, Dataiku only accepts ascii. The issue is, that the app does not let me specify the encoding of Splunk data (UTF-8, ascii, ...). Since Splunk data is usually rather voluminous, fixing this problem on the Splunk side means reindexing a lot of data which is associated with significant costs.

Following an example error message:

`... 2023-03-21 15:44:56,091 INFO SplunkIndexConnector:Connected to Splunk

2023-03-21 15:44:56,093 INFO Processing task: read_rows

2023-03-21 15:44:57,899 ERROR Connector send fail, storing exception Traceback (most recent call last): File "/.../dataiku-dss-11.0.2/python/dataiku/connector/server.py", line 110, in serve read_rows(connector, schema, partitioning, partition_id, limit, output) File "/.../dataiku-dss-11.0.2/python/dataiku/connector/server.py", line 32, in read_rows for row in connector.generate_rows(schema, partitioning, partition_id, limit): File "/tmp/tmp_folder_nSFiBdol/dku_code.py", line 91, in generate_rows

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 173000: ordinal not in range(128) 2023-03-21 15:44:57,901 INFO Processing task: finish_read_session`

M-Shash commented 2 months ago

any update here, or @Schiggy-3000 did you manage to work around?

Schiggy-3000 commented 2 months ago

Hi @M-Shash

In deed there is a workaround. You can find the solution here: https://community.dataiku.com/discussion/33251/importing-splunk-data#latest

Or in brief, do the following:

Convert the plugin into a dev plugin, then edit line 91 of the file named python-connectors/splunk_import-index/connector.py, from this:

for sample in content.decode().split("\n"):

into this:

for sample in content.decode("utf-8").split("\n"):