artefactory / artefactory-connectors-kit

ACK is an E(T)L tool specialized in API data ingestion. It is accessible through a Command-Line Interface. The application allows you to easily extract, stream and load data (with minimum transformations), from the API source to the destination of your choice.
GNU Lesser General Public License v3.0
42 stars 6 forks source link

Fix needed: the default .csv field_size_limit is exceeded while making requests with the DV360 reader #81

Closed tom-grivaud closed 3 years ago

tom-grivaud commented 3 years ago

ERROR AND WHY : While collecting data from the platform dv360 I encountered this issue :

Traceback (most recent call last):
  File "nck/entrypoint.py", line 86, in <module>
    app()
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 1164, in invoke
    return _process_result(rv)
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 1101, in _process_result
    value = ctx.invoke(self.result_callback, value,
  File "/.../nautilus-connectors-kit/nautilus-env/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "nck/entrypoint.py", line 68, in run
    writer.write(stream)
  File "/.../nautilus-connectors-kit/nck/writers/console_writer.py", line 44, in write
    buffer = file.read(1024)
  File "/.../nautilus-connectors-kit/nck/streams/stream.py", line 114, in readinto
    chunk = self.leftover or encode(next(iterable))
  File "/.../nautilus-connectors-kit/nck/utils/file_reader.py", line 36, in sdf_to_njson_generator
    for line in dict_reader:
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/csv.py", line 111, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)

From the error message I was able to set a csv.field_size_limit above the 131072 default limit.

HOW TO FIX IT AND FURTHER INVESTIGATIONS : By adding this line of code to the file_reader.py the error vanished and I was able to get my result prompted on the console.

The line is the following and was added to the nck/utils/file_reader.py file(replace 1000000 by another limit to discuss) :

csv.field_size_limit(10000000)

Even if it worked I noticed that a field was containing an outrageous number of ids. I think that before setting this new csv.field_size_limit it could be interesting to check if there is no mistake in the process that would cause a field to contain way more ids than what it really should have.

benoitgoujon commented 3 years ago

Thank you @tom-grivaud for bringing this issue to our attention.

It seems like you noticed this behaviour while writing directly in the console. Can you reproduce this behaviour by writing in a local file? (write_local command)

If yes, do you have the same problem writing the output file in a bucket?

I would like to be sure it is a general problem and not only for the console writer.

gabrielleberanger commented 3 years ago

Thank you @tom-grivaud !

When you say "a field contained an outrageous number of IDs", do you mean that a single line of your .csv (supposed to contain a single record) actually featured multiple records ?

I have to admit that I am not very knowledgeable about the DV360 reader. @bibimorlet, are you using it for Samsung (if I remember well, we were rather using the DBM reader) ? If yes, did you notice any issue on the output data?

tom-grivaud commented 3 years ago

@benoitgoujon we tried with both write_console and write_s3 but none of these worked, the same error occured. @gabrielleberanger Not exactly, the file is containing various jsons having key-value pairs and one of the value from a json has this huge number of id as a single string.

Let me know if something is not clear for you guys.