BritishGeologicalSurvey / etlhelper

ETL Helper is a Python ETL library to simplify data transfer into and out of databases.
https://britishgeologicalsurvey.github.io/etlhelper/
GNU Lesser General Public License v3.0
104 stars 25 forks source link

Make read_lob() available for fetchone, fetchmany and fetchall #110

Closed volcan01010 closed 1 year ago

volcan01010 commented 3 years ago

Summary

As an ETL user, I want read_lob to be available to all fetching methods so that I can read Oracle LOBs in results.

Description

Oracle LOBs are used to store geospatial data. They have to be explicitly read to be able to use them. The

The keyword is currently only available for copy_rows but should be usable more widely.

Acceptance criteria

volcan01010 commented 3 years ago

Actually, I've changed my mind. read_lob is a specific transform and should be carried out by the transform function. It should be deprecated and replaced with an example transform.

volcan01010 commented 3 years ago

Also, it is possible to add an extra connection setting to read the LOBs directly:

https://cx-oracle.readthedocs.io/en/latest/user_guide/lob_data.html

This makes things more than 10x faster for me.

volcan01010 commented 3 years ago

You can also get Oracle to do the conversion via the SQL request, but the resulting VARCHAR is limited to 4,000 characters so it won't work for large geometry objects with many vertices.

SELECT DBMS_LOB.SUBSTR( lob_column, 4000, 1 ) FROM table
volcan01010 commented 2 years ago

The simplest solution for this is to make all Oracle connections supplied by ETL Helper turn on the reading of CLOBs and BLOBs as strings and bytes by default. ETL Helper is aimed at spatial databases, where this is required. It brings the behaviour of Oracle in line with PostgreSQL text fields, too.

This will be a breaking change. If users need to to access LOBs as streaming objects then they will have to provide their own connection or drop the output_type_handler once the connection has been made.

volcan01010 commented 2 years ago

The new Oracle driver has a flag to read LOBs as Python objects: https://cjones-oracle.medium.com/open-source-python-thin-driver-for-oracle-database-e82aac7ecf5a

volcan01010 commented 1 year ago

Note: Now that #151 has been merged, we should test how the new oracledb driver behaves with CLOBS (these are often used when returning geometries as WKT). If it is signficantly faster to return them as strings (I think that it will be), then we should set that as the default and document how to change it.

https://python-oracledb.readthedocs.io/en/latest/user_guide/lob_data.html#fetching-lobs-as-strings-and-bytes

cc: @sophie-taylor @leorudczenko

volcan01010 commented 1 year ago

Closing as this has been merged into for_v1.