aliyun / alibabacloud-jindodata

alibabacloud-jindodata
Apache License 2.0
176 stars 46 forks source link

spark配置使用jindofs访问hive的orc表报错 #60

Closed urzeric closed 3 years ago

urzeric commented 3 years ago

通过在阿里云ECS自建hadoop,存储用的是阿里云oss;

hadoop 版本是3.1.1; hive3.1.0 ; spark版本 2.3.2;

配置使用的是jindofs-jdk-3.4.0方式来读写oss;

通过hive验证查询一切正常;

但是通过spark-sql 的时候,发现查询hive的ORC表异常(没有orc压缩的表都正常);

 spark-sql> select count(1) from dsp.log_bid_request_orc where dt=20210418; 

通过where限定了分区为20210418,但发现会扫描过往所有的表所有的日期,不知道为什么过滤条件没有生效,是否其他配置问题? 报错信息如下:

21/04/20 14:52:04 INFO FsStats: cmd=download, src=oss://dxxxh.oss-cn-beijing-internal.aliyuncs.com/dsp/log_bid_request_orc/dt=20191201/hour=00/adx=12/000000_0, dst=null, size=32790412, parameter=byteReaded:0,byteNeeded:0,readTimes:0, time-in-ms=0, version=3.4.0
21/04/20 14:52:04 ERROR SparkSQLDriver: Failed in [select count(1) from dsp.log_bid_request_orc where dt=20210418]
java.io.IOException: java.io.IOException: ErrorCode : 403 , ErrorMsg: HTTP/1.1 403 Forbidden: <?xml version="1.0" encoding="UTF-8"?>
<Error>
  <Code>InvalidObjectState</Code>
  <Message>The operation is not valid for the object's state</Message>
  <RequestId>607E7A149F94E73735F40283</RequestId>
  <HostId>dxxxh.oss-cn-beijing-internal.aliyuncs.com</HostId>
  <ObjectName>dsp/log_bid_request_orc/dt=20191201/hour=00/adx=12/000000_0</ObjectName>
</Error>
 ERROR_CODE : 1010
    at com.alibaba.jboot.JbootBlockletReader.read(JbootBlockletReader.java:58)
    at com.alibaba.jboot.JbootBlockletReader.read(JbootBlockletReader.java:47)
    at com.alibaba.jboot.JbootBlockletReader.randomRead(JbootBlockletReader.java:33)
    at com.aliyun.emr.fs.internal.ossnative.JindoOssInputStream.readFromPostion(JindoOssInputStream.java:98)
    at com.aliyun.emr.fs.internal.JindoInputStream.readFully(JindoInputStream.java:213)
    at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
    at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:549)
    at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:364)
    at org.apache.orc.OrcFile.createReader(OrcFile.java:251)
    at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.readSchema(OrcUtils.scala:59)
    at org.apache.spark.sql.execution.datasources.orc.OrcUtils$$anonfun$readSchema$3.apply(OrcUtils.scala:82)
    at org.apache.spark.sql.execution.datasources.orc.OrcUtils$$anonfun$readSchema$3.apply(OrcUtils.scala:82)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at scala.collection.TraversableOnce$class.collectFirst(TraversableOnce.scala:145)
    at scala.collection.AbstractIterator.collectFirst(Iterator.scala:1334)
    at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.readSchema(OrcUtils.scala:82)
    at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:84)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$inferIfNeeded(HiveMetastoreCatalog.scala:239)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$4$$anonfun$5.apply(HiveMetastoreCatalog.scala:167)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$4$$anonfun$5.apply(HiveMetastoreCatalog.scala:156)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$4.apply(HiveMetastoreCatalog.scala:156)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$4.apply(HiveMetastoreCatalog.scala:148)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.withTableCreationLock(HiveMetastoreCatalog.scala:54)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:148)
    at org.apache.spark.sql.hive.RelationConversions.org$apache$spark$sql$hive$RelationConversions$$convert(HiveStrategies.scala:212)
    at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:239)
    at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:228)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.a

配置步骤:

  1. https://github.com/aliyun/alibabacloud-jindofs/blob/master/docs/jindofs_sdk_how_to_hadoop.md

  2. https://github.com/aliyun/alibabacloud-jindofs/blob/master/docs/spark/jindosdk_on_spark.md

    已经在阿里云提交工单:

    ​ 工单编号:0006913TNY

疑问:

  1. 你好, 非orc表的查询是正常的,orc表使用spark-sql查询失败
  2. 为什么使用spark-sql查询orc表,默认就全局扫描。
urzeric commented 3 years ago

spark-sql读取orc表过滤条件失效的问题,已经解决,非jindofs-jdk问题

通过更改一下配置解决: spark.sql.hive.convertMetastoreOrc true->false spark.sql.hive.convertMetastoreOrc true->false

参考 https://blog.csdn.net/u013332124/article/details/109359773

adrian-wang commented 3 years ago

是没有全表扫描的权限吗?

urzeric commented 3 years ago

谢谢回复

不是因为没有全表扫描的权限,已经限定条件分区,按逻辑不应该全表扫描(刚好那部分数据已经在oss被置为冷数据)

spark.sql.hive.convertInsertingPartitionedTable参数的作用 Hive写入Parquet/Orc表时,实现了自己的一个SerDe,Spark觉的Hive的SerDe性能比较低,于是实现了自己的SerDe。因此碰到Parquet、Orc的表数据写入时,SparkSQL默认使用自己内部的SerDe

将以上两个参数 属性改成false之后,默认使用hive自己的SerDe,读写就正常了

另外还有一种兼容,没有尝试,不知道有没有效果 https://support.huaweicloud.com/cmpntguide-mrs/mrs_01_2028.html

adrian-wang commented 3 years ago

使用不同的serde不会导致oss服务端报403,403都是权限问题。spark默认会扫orc和parquet表的全表list,关闭spark.sql.hive.convertMetastoreOrc的时候可以跳过orc表的全表目录list,所以你没遇到报错,应该还是权限有关,这个参数只是绕开不必要的权限。跟你发的另外一种兼容没啥关系。

urzeric commented 3 years ago

了解了, 你解释这个应该是这个问题的根源,我只不过是绕过了这个问题, 但是这个全表目录list扫描的机制 有什么特别应用场景吗,提高了延迟,感觉有些鸡肋。

谢谢。

urzeric commented 3 years ago

补充: 查看oss 文件 :

  1. 对比 20210201 和 20191201 文件权限好像没什么分别
  2. hadoop fs -ls 查看oss怎么看不出所属用户 和 所属组
    root@Dsphdp001:/opt# hadoop fs -ls -h oss://dxxxh/dsp/log_bid_request_orc/dt=20210201/hour=00/adx=12/
    Found 1 items
    -rw-rw-rw-   1     44.3 M 2021-02-01 01:29 oss://dxxxh/dsp/log_bid_request_orc/dt=20210201/hour=00/adx=12/000000_0
    root@Dsphdp001:/opt# hadoop fs -ls -h oss://dxxxh/dsp/log_bid_request_orc/dt=20191201/hour=00/adx=12/
    Found 4 items
    -rw-rw-rw-   1     31.3 M 2019-12-01 01:11 oss://dxxxh/dsp/log_bid_request_orc/dt=20191201/hour=00/adx=12/000000_0
    -rw-rw-rw-   1     31.3 M 2019-12-01 01:11 oss://dxxxh/dsp/log_bid_request_orc/dt=20191201/hour=00/adx=12/000001_0
    -rw-rw-rw-   1     31.3 M 2019-12-01 01:11 oss://dxxxh/dsp/log_bid_request_orc/dt=20191201/hour=00/adx=12/000002_0
    -rw-rw-rw-   1     31.2 M 2019-12-01 01:11 oss://dxxxh/dsp/log_bid_request_orc/dt=20191201/hour=00/adx=12/000003_0
    root@Dsphdp001:/opt# hadoop fs -ls -h /
    Found 13 items
    drwxrwxrwt   - yarn   hadoop          0 2020-12-11 10:16 /app-logs
    drwxr-xr-x   - hdfs   hdfs            0 2020-12-29 16:54 /apps
    drwxr-xr-x   - yarn   hadoop          0 2020-12-07 13:15 /ats
    drwxr-xr-x   - hdfs   hdfs            0 2020-12-07 13:15 /atsv2
    drwxr-xr-x   - hdfs   hdfs            0 2020-12-07 13:15 /hdp
    drwxr-xr-x   - hdfs   hdfs            0 2021-04-06 14:16 /kylin
    drwx------   - livy   hdfs            0 2020-12-28 11:53 /livy2-recovery
    drwxr-xr-x   - mapred hdfs            0 2020-12-07 13:15 /mapred
    drwxrwxrwx   - mapred hadoop          0 2020-12-07 13:15 /mr-history
    drwxrwxrwx   - spark  hadoop          0 2021-04-25 19:27 /spark2-history
    drwxrwxrwx   - hdfs   hdfs            0 2021-01-13 15:30 /tmp
    drwxr-xr-x   - root   hdfs            0 2021-04-20 16:28 /user
    drwxr-xr-x   - hdfs   hdfs            0 2020-12-07 13:36 /warehouse
    root@Dsphdp001:/opt# hadoop fs -ls -h oss://dxxxh/
    Found 5 items
    drwxrwxrwx   -          0 1970-01-01 08:00 oss://dxxxh/algo
    drwxrwxrwx   -          0 1970-01-01 08:00 oss://dxxxh/dmp
    drwxrwxrwx   -          0 2021-04-23 18:42 oss://dxxxh/dsp
    drwxrwxrwx   -          0 2020-06-05 11:44 oss://dxxxh/tem
    drwxrwxrwx   -          0 1970-01-01 08:00 oss://dxxxh/user
adrian-wang commented 3 years ago

oss服务端的权限和你在hadoop fs命令看到的acl权限是不一样的,hadoop fs命令中填的OSS权限都是假的,以OSS服务端为准。

adrian-wang commented 3 years ago

全表扫描是抽样推断表schema信息,这个是datasource表要求的,如果关掉那个convert,spark把hive的orc表不当做datasource表,直接调用hive 的reader去读,就没有这个事情了。

urzeric commented 3 years ago

您说的acl权限,应该是这个吧,设置文件读写权限ACL, 此ram帐号的权限是可以控制整个oss,不应该会是拥有20210201的读写权限, 而不具备读写20191201的权限啊 Screenshot_select-area_20210428142256

adrian-wang commented 3 years ago

然而oss服务端就是返回了这个,可以给oss提工单

urzeric commented 3 years ago

好的,谢谢解答!

adrian-wang commented 3 years ago

先关闭issue了,还有问题可以reopen

urzeric commented 3 years ago

OK

joint-song commented 1 year ago

https://www.alibabacloud.com/help/en/object-storage-service/latest/http-403-status-code

阿里云文档链接里的InvalidObjectState状态似乎跟权限无关?主要是读取了已归档的数据导致