DTStack / chunjun

A data integration framework
https://dtstack.github.io/chunjun/
Apache License 2.0
3.98k stars 1.69k forks source link

[Feature][s3] 增加支持读取apache tika 支持的所有类型文档、Excel #1918

Closed libailin closed 1 week ago

libailin commented 1 month ago

Search before asking

Description

增加支持读取apache tika 支持的所有类型文档 增加支持读取excel格式文件

两类参数不支持同时使用。

Use case

CREATE TABLE source
(
    content String,
    metadata String
) WITH (
    'connector' = 's3-x',
    'assessKey' = 'xxx',
    'secretKey' = 'xxx',
    'bucket' = 'di-test',
    'objects' = '["/pdf-source/20240528/.*"]',
    'endpoint' = 'http://10.x.x.x',
    -- 是否启动分块, 默认false
    'tika-use-extract' = 'true'
    -- 分块大小, 默认 -1 不分块,抽取取全部
    ,'tika-chunk-size' = '40'
    -- 内容重合度比例值 0-100
    ,'tika-overlap-ratio' = '0'
    -- 禁用 Bucket 名称注入到 endpoint 前缀, 默认false, 如果使用域名需要设置成true
    ,'disableBucketNameInEndpoint' = 'true'
    -- 匹配对象的正则表达式
    ,'objectsRegex' = '.*\.doc'
   -- 读取excel 文件
    ,'use-excel-format' = 'true'
   -- 配置对应到excel里列索引
    ,'column-index'='0,1,3'
    --指定读取excel里具体的工作表
    ,'sheet-no'='0,2'
);

CREATE TABLE sink
(
    content String,
    metadata String
) WITH (
      'connector' = 'stream-x',
      'print' = 'true'
      );

INSERT INTO sink SELECT * FROM source;

Related issues

No response

Are you willing to submit a PR?

Code of Conduct