capitalone / giraffez

User-friendly Teradata client for Python
https://capitalone.github.io/giraffez
Apache License 2.0
108 stars 35 forks source link

Teradata JSON column load error #61

Closed xevix closed 4 years ago

xevix commented 6 years ago

Using giraffez CLI load command to put data in a table with a JSON column I get this error.

# giraffez -vvv load mycsv.txt "mydb.mytable"
/usr/local/lib/python3.6/dist-packages/giraffez/utils.py:129: UserWarning: USING MLOAD TO INSERT LESS THAN 100000 ROWS IS NOT RECOMMENDED - USE LOAD INSTEAD!
Connection: Connecting to data source 'other' ...
Connection: Connection to 'other' established successfully.
BulkLoad: Executing ...
  source      => "mycsv.txt"
  output      => "mydb.mytable"
BulkLoad: Initiating Teradata PT request (awaiting server)  ...
Connection: Closing Teradata connection ...
Connection: Connection to 'other' closed.
Traceback (most recent call last):
  File "/usr/local/bin/giraffez", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/giraffez/__main__.py", line 25, in main
    MainCommand().run()
  File "/usr/local/lib/python3.6/dist-packages/giraffez/core.py", line 80, in run
    args.run(args)
  File "/usr/local/lib/python3.6/dist-packages/giraffez/commandline.py", line 431, in run
    exit_code = load.from_file(args.input_file, delimiter=args.delimiter, null=args.null, quotechar=args.quote_char, parse_dates=args.parse_dates)
  File "/usr/local/lib/python3.6/dist-packages/giraffez/load.py", line 254, in from_file
    self._initiate()
  File "/usr/local/lib/python3.6/dist-packages/giraffez/load.py", line 430, in _initiate
    self.mload.initiate(self.table, self.columns)
giraffez.TeradataError: 5674: LOBs are not allowed to be selected in Record or Indicator modes.

Looking online it seems like if I were using ODBC I could make this go away by adding this:

EnableLobSupport=Yes

Unfortunately I can't see how to pass connection string options to Giraffez. Is there a way to do this? Or is the JSON column not supported?

Thanks!

ChrisRx commented 6 years ago

giraffez load uses the multiload protocol (via Teradata's C++ TPT API). This protocol is most useful when dealing with a very large amount of data vs. ODBC or the CLIv2 with insert statements, however, Teradata server does not support LOB column types as seen here in the official documentation. In the case that you are working with LOB column types you will have to use something like giraffez insert which uses Teradatas multi-statement mode to batch and send inserts. I've not dealt with LOB column types before so I'm unsure if this will work naively but is worth a shot (if a row of your data is greater than 64k, it will most likely fail). If it doesn't end up working, I think it could be explored to see how Teradata wants to insert LOB content and some changes could be made to giraffez that might facilitate LOB column types.

xevix commented 6 years ago

Ah, I see. Thanks for the swift and detailed reply, that's unfortunate load does not support LOB.

Is multi-statement mode less performant than multiload? If so, then I was thinking to go about this use case in a 2-step process:

  1. Load data to a target table where the originally JSON column is a varchar, which does appear to be supported according to Teradata.

In general, you cannot import or export LOB JSON data using FastLoad, MultiLoad, or FastExport protocols using either the legacy stand-alone load tools or Parallel Transporter. However, you can use these utilities to load or unload JSON data in the following cases: • You can load JSON data if it is less than 64 KB, and the target table defines the column as CHAR or VARCHAR. • If you set the transform group for the user to use a transform group that converts JSON to/from VARCHAR or VARBYTE.

  1. Do an INSERT into the final table, SELECTing from the table in step 1, but using NEW JSON() to convert the varchar over.

This isn't particularly ideal, but I'm hoping it should allow a way forward. Unfortunately I'm not too well versed in the C++ TPT API so I'm not sure what changes could be made in giraffez that would make the general process easier.

ChrisRx commented 6 years ago

Yes, multi-statement mode is far less performant than multiload. It is essentially a parallel stream of insert statements vs multiload which is a binary protocol designed specifically for insert/update performance.

I'm not familiar with some of Teradata's newer features. It appears that the newer versions of their libraries may support LOB data types with the Update driver (aka multiload), but still having the limitation of only being able to support individual row sizes less than 64K. So if you are ok with that limitation it should work with giraffez if you have the appropriate version for both Teradata server and the local libraries you have installed. What version of Teradata server are you using and what version of the CLIv2 and TPTAPI do you have installed? The TPTAPI seems to have started support for the JSON data types and LOB data types for the Update driver in version 15.10. For background on what these libraries are, there is short description here.