chris-bishop-tfs / data_io

5 stars 0 forks source link

Update Oracle JDBC Defaults #4

Closed chris-bishop-tfs closed 3 years ago

chris-bishop-tfs commented 3 years ago

Turns out Oracle defaults to retrieving 10 rows at a time, applying an operation, then grabbing the next 10 rows. This is a key reason for the slow processing times.

This can be changed by providing a fetchsize option. See below for an example using our brand new data IO API.

from data_io import build_connection
# Define data locations
rom_targets_url = 'oracle://user@CDWPRD-rac-db.thermo.com:1521/cdwprd_users.cdwcmmo.recurring_orders'
# Start loading some data
#  Increasing fetchsize otherwise load is silly slow.
#  Reference: https://docs.oracle.com/cd/A87860_01/doc/java.817/a83724/resltse5.htm
rom_targets = build_connection(rom_targets_url).read(fetchsize=10000).cache()

Relevant references:

https://docs.oracle.com/cd/A87860_01/doc/java.817/a83724/resltse5.htm https://stackoverflow.com/questions/40007997/why-does-spark-query-load-from-oracle-is-so-slow-comparing-to-sqoop

chris-bishop-tfs commented 3 years ago

Added to defaults