apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 984 forks source link

[DRILL-8457] Allow configuring csv parser in http storage plugin configuration #2840

Closed ztomanek-dw closed 9 months ago

ztomanek-dw commented 9 months ago

DRILL-8457: Allow configuring csv parser in http storage plugin configuration

Description

HttpApiConfiguration was extended with csvOptions field which allows setting a following properties:

{
  "csvOptions": {
    "delimiter": ",",
    "quote": "\"",
    "quoteEscape": "\"",
    "lineSeparator": "\n",
    "headerExtractionEnabled": null,
    "numberOfRowsToSkip": 0,
    "numberOfRecordsToRead": -1,
    "lineSeparatorDetectionEnabled": true,
    "maxColumns": 512,
    "maxCharsPerColumn": 4096,
    "skipEmptyLines": true,
    "ignoreLeadingWhitespaces": true,
    "ignoreTrailingWhitespaces": true,
    "nullValue": null
  }
}

this provides greater csv parsing flexibility since user can set different delimiters, number of columns or max column size.

Also backward compatibility is ensured and parser works same as before if csvOptions is null.

Documentation

Add a following paragraph into https://drill.apache.org/docs/http-storage-plugin/#configuring-the-api-connections

CSV parser options

CSV parser of HTTP Storage plugin can be configured using csvOptions.

{
  "csvOptions": {
    "delimiter": ",",
    "quote": "\"",
    "quoteEscape": "\"",
    "lineSeparator": "\n",
    "headerExtractionEnabled": null,
    "numberOfRowsToSkip": 0,
    "numberOfRecordsToRead": -1,
    "lineSeparatorDetectionEnabled": true,
    "maxColumns": 512,
    "maxCharsPerColumn": 4096,
    "skipEmptyLines": true,
    "ignoreLeadingWhitespaces": true,
    "ignoreTrailingWhitespaces": true,
    "nullValue": null
  }
}

E.g. to parse .tsv files you can use a following config:

{
  "csvOptions": {
    "delimiter": "\t"
  }
}

Testing

Create a following storage plugin with name github

{
  "type": "http",
  "connections": {
    "test-data": {
      "url": "https://raw.githubusercontent.com/semantic-web-company/wic-tsv/master/data/de/Test/test_examples.txt",
      "requireTail": false,
      "method": "GET",
      "authType": "none",
      "inputType": "csv",
      "xmlDataLevel": 1,
      "postParameterLocation": "QUERY_STRING",
      "csvOptions": {
        "delimiter": "\t",
        "quote": "\"",
        "quoteEscape": "\"",
        "lineSeparator": "\n",
        "numberOfRecordsToRead": -1,
        "lineSeparatorDetectionEnabled": true,
        "maxColumns": 512,
        "maxCharsPerColumn": 4096,
        "skipEmptyLines": true,
        "ignoreLeadingWhitespaces": true,
        "ignoreTrailingWhitespaces": true
      },
      "verifySSLCert": true
    }
  },
  "timeout": 5,
  "retryDelay": 1000,
  "proxyType": "direct",
  "authMode": "SHARED_USER",
  "enabled": true
}

Then query tsv file with

SELECT * from github.`test-data`

You should see a result set containing three columns

ztomanek-dw commented 9 months ago

@cgivre Thanks for your feedback!

According to your comments:

Let me know if you see anything else to cover :)

ztomanek-dw commented 9 months ago

@cgivre Thanks for the clarification, I was not sure if I could push multiple commits per one jira issue. I've applied your suggestions and made sure it's rebased to current master