Temporary CSV files generated have control M characters

embulk / embulk-input-jdbc

MySQL, PostgreSQL, Redshift and generic JDBC input plugins for Embulk

Other

102 stars 74 forks source link

Temporary CSV files generated have control M characters #175

Closed OmkarPathak closed 4 years ago

OmkarPathak commented 4 years ago

We are trying to load data from Oracle to Bigquery. However, temporary CSV files that are generated contain control M characters which results in newline. Bugquery hence cannot process this csv file. Are there any filters to get rid of such cases? Any help would be appreciated. Note: Specified tables are around 15+ GBs in size. Sample config.yml:

in:
  type: oracle
  driver_path: /ojdbc7-12.1.0.2.jar
  url: jdbc:oracle:thin:@something.com:1526/DB
  user: user
  password: "password"
  query: "SELECT * SCHEMA.TABLE"
  fetch_rows: 4000
  connect_timeout: 100
  formatter:
    type: csv
    delimiter: ","
    newline: CR
    newline_in_field: CR
    escape: "\\"
    null_string: "\\N"
out:
   type: bigquery
   mode: replace
   auth_method: service_account
   project: project
   dataset: "dataset"
   table: "table"
   location: europe-west2
   json_keyfile: credentials.json
   allow_quoted_newlines: true
   abort_on_error: false
   delete_from_local_when_job_end: false

hiroyuki-sato commented 4 years ago

Hello, @OmkarPathak

I'm not sure temporary CSV means. It seems that input data is an Oracle database.

Anyway, Have you ever tried the embulk-filter-ruby_proc plugin? If an input data contains CR, you can remove/replace it.

This is an example.

filters:
  - type: ruby_proc
    columns:
    - name: account
      proc: |
        -> (record){ record.nil? ? nil :  record.gsub(",","") }

Change CR -> LF

filters:
  - type: ruby_proc
    columns:
    - name: account
      proc: |
        -> (record){ record.nil? ? nil :  record.gsub("\r","\n") }

Trim CR

filters:
  - type: ruby_proc
    columns:
    - name: account
      proc: |
        -> (record){ record.nil? ? nil :  record.gsub("\r","") }

Another idea is to use embulk-filter-to_json for example. Could you read this document? https://github.com/embulk/embulk-output-bigquery#formatter-performance-issue

sakama commented 4 years ago

I don't think embulk-input-oracle creates temporary CSV file. Additionally, embulk-input-oracle is a Input plugin, isn't a File input plugin. You can't use formatter: option with embulk-input-oracle. Formatter plugin only works with File input plugin.

The component that creates a CSV file is embulk-output-bigquery. https://github.com/embulk/embulk-output-bigquery/blob/a4b25bd4b814d9f7f5efb3df0af8fc19a48f0a8f/lib/embulk/output/bigquery/file_writer.rb#L29-L34

Why don't you use source_format: NEWLINE_DELIMITED_JSON ? So that file will be formatted with JSON. https://github.com/embulk/embulk-output-bigquery/#same-options-of-bq-command-line-tools-or-bigquery-jobs-property Although the JSON file also may contain a newline, the process itself would succeed.

OmkarPathak commented 4 years ago

@sakama @hiroyuki-sato thanks for such quick responses. I tried the solution provided by @sakama and it worked! Thanks a lot 😄