apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.64k stars 1.41k forks source link

Hive Query failed if the data type is array<string> with parquet files #1542

Closed asfimport closed 9 years ago

asfimport commented 10 years ago

This issue has long back posted on Parquet issues list and Since this is related to Parquet Hive serde, I have created the Hive issue here, The details and history of this information are as shown in the link here https://github.com/Parquet/parquet-mr/issues/281.

Reporter: Sathish Assignee: Ryan Blue / @rdblue

Related issues:

Note: This issue was originally created as PARQUET-83. Please see the migration documentation for further details.

asfimport commented 10 years ago

Sathish: This patch fixes this issue,Since this feature we want to use in the next release of Hive. Requesting someone to look into this patch changes and merge to the main branch.

asfimport commented 10 years ago

Sathish: Can someone look into this issue and provide any comments or suggestions for this fix. Provided the patch and waiting for this patch to be merged to the main branch as this feature of Hive we want use in our next release.

asfimport commented 10 years ago

Szehon Ho: Hi Satish, can you please fix the formatting? Indents are 2 spaces (hive code is like that), and put a space after the comma, etc.

Otherwise it looks good to me. But granted, I'm not an expert of parquet schema, so my only question is that it compatible with other tools? + [~jcoffey], @rdblue for comments (if any).

asfimport commented 10 years ago

Hive QA:

Overall: -1 no tests executed

Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12663651/HIVE-7850.patch

Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/465/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/465/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-465/

Messages:

Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]]
+ export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
+ JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
+ export PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-maven-3.0.5/bin:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
+ PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-maven-3.0.5/bin:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
+ M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
+ cd /data/hive-ptest/working/
+ tee /data/hive-ptest/logs/PreCommit-HIVE-TRUNK-Build-465/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ svn = \s\v\n ]]
+ [[ -n '' ]]
+ [[ -d apache-svn-trunk-source ]]
+ [[ ! -d apache-svn-trunk-source/.svn ]]
+ [[ ! -d apache-svn-trunk-source ]]
+ cd apache-svn-trunk-source
+ svn revert -R .
Reverted 'hbase-handler/src/test/results/positive/hbase_custom_key3.q.out'
Reverted 'hbase-handler/src/test/results/positive/hbase_ppd_key_range.q.out'
Reverted 'hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestHBaseKeyFactory.java'
Reverted 'hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestHBaseKeyFactory2.java'
Reverted 'hbase-handler/src/test/queries/positive/hbase_ppd_key_range.q'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseScanRange.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/CompositeHBaseKeyFactory.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/DefaultHBaseKeyFactory.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseRowSerializer.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/AbstractHBaseKeyFactory.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/LazyHBaseRow.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/AbstractHBaseKeyPredicateDecomposer.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseInputFormatUtil.java'
Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/ColumnMappings.java'
Reverted 'ql/src/test/org/apache/hadoop/hive/ql/exec/vector/TestVectorizationContext.java'
Reverted 'ql/src/java/org/apache/hadoop/hive/ql/ppd/OpProcFactory.java'
Reverted 'ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java'
Reverted 'ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java'
Reverted 'ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g'
Reverted 'ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStoragePredicateHandler.java'
Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java'
Reverted 'ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBetween.java'
++ egrep -v '^X|^Performing status on external'
++ awk '{print $2}'
++ svn status --no-ignore
+ rm -rf target datanucleus.log ant/target shims/target shims/0.20/target shims/0.20S/target shims/0.23/target shims/aggregator/target shims/common/target shims/common-secure/target packaging/target hbase-handler/target hbase-handler/src/test/results/positive/hbase_ppd_or.q.out hbase-handler/src/test/queries/positive/hbase_ppd_or.q hbase-handler/src/java/org/apache/hadoop/hive/hbase/OrPredicateHBaseKeyFactory.java hbase-handler/src/java/org/apache/hadoop/hive/hbase/predicate hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseScanFactory.java testutils/target jdbc/target metastore/target itests/target itests/hcatalog-unit/target itests/test-serde/target itests/qtest/target itests/hive-unit-hadoop2/target itests/hive-minikdc/target itests/hive-unit/target itests/custom-serde/target itests/util/target hcatalog/target hcatalog/core/target hcatalog/streaming/target hcatalog/server-extensions/target hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target hcatalog/hcatalog-pig-adapter/target accumulo-handler/target hwi/target common/target common/src/gen contrib/target service/target serde/target beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target
+ svn update

Fetching external item into 'hcatalog/src/test/e2e/harness'
External at revision 1619922.

At revision 1619922.
+ patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hive-ptest/working/scratch/build.patch
+ [[ -f /data/hive-ptest/working/scratch/build.patch ]]
+ chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh
+ /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch
The patch does not appear to apply with p0, p1, or p2
+ exit 1
'

This message is automatically generated.

ATTACHMENT ID: 12663651

asfimport commented 10 years ago

Sathish: New patch file submitted by correcting indentations.

asfimport commented 10 years ago

Hive QA:

Overall: -1 no tests executed

Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12664109/HIVE-7850.1.patch

Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/487/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/487/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-487/

Messages:

Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]]
+ export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
+ JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
+ export PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-maven-3.0.5/bin:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
+ PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-maven-3.0.5/bin:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
+ M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
+ cd /data/hive-ptest/working/
+ tee /data/hive-ptest/logs/PreCommit-HIVE-TRUNK-Build-487/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ svn = \s\v\n ]]
+ [[ -n '' ]]
+ [[ -d apache-svn-trunk-source ]]
+ [[ ! -d apache-svn-trunk-source/.svn ]]
+ [[ ! -d apache-svn-trunk-source ]]
+ cd apache-svn-trunk-source
+ svn revert -R .
Reverted 'common/src/java/org/apache/hadoop/hive/conf/HiveConf.java'
Reverted 'common/src/java/org/apache/hadoop/hive/conf/Validator.java'
Reverted 'service/src/java/org/apache/hive/service/cli/OperationState.java'
Reverted 'service/src/java/org/apache/hive/service/cli/session/HiveSession.java'
Reverted 'service/src/java/org/apache/hive/service/cli/session/HiveSessionImpl.java'
Reverted 'service/src/java/org/apache/hive/service/cli/session/HiveSessionBase.java'
Reverted 'service/src/java/org/apache/hive/service/cli/session/SessionManager.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/Operation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/OperationManager.java'
++ awk '{print $2}'
++ egrep -v '^X|^Performing status on external'
++ svn status --no-ignore
+ rm -rf target datanucleus.log ant/target shims/target shims/0.20/target shims/0.20S/target shims/0.23/target shims/aggregator/target shims/common/target shims/common-secure/target packaging/target hbase-handler/target testutils/target jdbc/target metastore/target itests/target itests/hcatalog-unit/target itests/test-serde/target itests/qtest/target itests/hive-unit-hadoop2/target itests/hive-minikdc/target itests/hive-unit/target itests/hive-unit/src/test/java/org/apache/hive/jdbc/miniHS2/TestHiveServer2SessionTimeout.java itests/custom-serde/target itests/util/target hcatalog/target hcatalog/core/target hcatalog/streaming/target hcatalog/server-extensions/target hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target hcatalog/hcatalog-pig-adapter/target accumulo-handler/target hwi/target common/target common/src/gen service/target contrib/target serde/target beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target
+ svn update

Fetching external item into 'hcatalog/src/test/e2e/harness'
External at revision 1620279.

At revision 1620279.
+ patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hive-ptest/working/scratch/build.patch
+ [[ -f /data/hive-ptest/working/scratch/build.patch ]]
+ chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh
+ /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch
The patch does not appear to apply with p0, p1, or p2
+ exit 1
'

This message is automatically generated.

ATTACHMENT ID: 12664109

asfimport commented 10 years ago

Ryan Blue / @rdblue: Looking at just the changes to the schema conversion, I'm not sure why the change to the list structure was done. Previously, lists were converted to:

// array<string> name
optional group name (LIST) {
  repeated group bag {
    optional string array_element;
  }
}

This allowed the list itself to be null and allowed null elements. This patch changes the conversion to:

// array<string> name
optional group name (LIST) {
  repeated string array_element;
}

This requires that the elements are non-null. Was this on purpose? The first one looks more correct to me, but the second would be if nulls aren't allowed in Hive lists. In addition, the HiveSchemaConverter#listWrapper method and ParquetHiveSerDe.ARRAY static field are no longer used but not removed.

The other change to schema conversion tests the Repetition and calls Types.required or Types.optional. This should instead call Types.primitive(type, repetition) to pass the repetition to the Types API. That way, Repetition.REPEATED is supported also, which is a bug in the current patch.

asfimport commented 10 years ago

Ryan Blue / @rdblue: It looks like ArrayWritableGroupConverter is only used for maps and arrays, but the array handling was added mostly in this patch. Given that most of the methods check isMap and have completely different implementations for map and array, it makes more sense to separate this into two classes, ArrayGroupConverter and MapGroupConverter. Then HiveSchemaConverter should choose the correct one based on the OriginalType annotation. If there is no original type annotation, but the type is repeated, it should use an ArrayGroupConverter.

asfimport commented 10 years ago

Sathish: Hi Ryan, I agree that the Hive should support lists with null elements. But can you give some idea on the cases where the no null lists are being generated, Whenever the parquet files are being generated from the Avro files most of the files are having the array schema as below

optional group name (LIST) {
  repeated string array_element;
}

Do you provide any suggestions on how best we can support for both kind of arrays. This patch only fix the arrays with no null entries.

asfimport commented 10 years ago

Sathish: Used Types.primitive(type,repetition) as suggested by ryan and also working on separating Maps and Arrays Group converters into two separate classes. I will update my patch once done with my changes.

Regarding the LIST structure can you give your suggestions on how we can support for both NULL elements list and Normal non null elements lists in Hive. I am of the opinion to build a separate structure for NULL elements list like (NULL_LIST) as shown below,

// array<string> name
optional group name (NULL_LIST) {
  repeated group bag {
    optional string array_element;
  }
}

Can you provide your suggestions on this.

asfimport commented 10 years ago

Sathish: New patch submitted based on comments and suggestions from ryan.

asfimport commented 10 years ago

Ryan Blue / @rdblue: The array fix is something we need to do on the parquet-avro module. We know it's not allowing null elements, but Hive was so that's why I mentioned it. Whether or not a null element is allowed depends on the repetition of the "array_element" field. If it is repeated, then it doesn't allow null. But the element inside the LIST has to be repeated, so to get a nullable type you have to create a new group, "array_element" with one element that is optional (and then name the repeated type "bag"). The easy way to support non-null and nullable array elements is to switch the "array_element" field between required and optional. But, I don't think we need to support non-null array elements.

If Hive has a array<string> type, are the element nullable? If they are, then we don't need to support the other case.

asfimport commented 10 years ago

Hive QA:

Overall: -1 no tests executed

Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12664368/HIVE-7850.2.patch

Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/506/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/506/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-506/

Messages:

Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]]
+ export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
+ JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
+ export PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-maven-3.0.5/bin:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
+ PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-maven-3.0.5/bin:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.6.0_34/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
+ M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
+ cd /data/hive-ptest/working/
+ tee /data/hive-ptest/logs/PreCommit-HIVE-TRUNK-Build-506/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ svn = \s\v\n ]]
+ [[ -n '' ]]
+ [[ -d apache-svn-trunk-source ]]
+ [[ ! -d apache-svn-trunk-source/.svn ]]
+ [[ ! -d apache-svn-trunk-source ]]
+ cd apache-svn-trunk-source
+ svn revert -R .
Reverted 'common/src/java/org/apache/hadoop/hive/conf/HiveConf.java'
Reverted 'service/src/java/org/apache/hive/service/cli/ICLIService.java'
Reverted 'service/src/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java'
Reverted 'service/src/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java'
Reverted 'service/src/java/org/apache/hive/service/cli/CLIServiceClient.java'
Reverted 'service/src/java/org/apache/hive/service/cli/CLIService.java'
Reverted 'service/src/java/org/apache/hive/service/cli/EmbeddedCLIServiceClient.java'
Reverted 'service/src/java/org/apache/hive/service/cli/session/HiveSession.java'
Reverted 'service/src/java/org/apache/hive/service/cli/session/HiveSessionImpl.java'
Reverted 'service/src/java/org/apache/hive/service/cli/session/HiveSessionBase.java'
Reverted 'service/src/java/org/apache/hive/service/cli/session/SessionManager.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/Operation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/MetadataOperation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/GetColumnsOperation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/GetSchemasOperation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/HiveCommandOperation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/GetTypeInfoOperation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/GetCatalogsOperation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/SQLOperation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/GetFunctionsOperation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/GetTablesOperation.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/OperationManager.java'
Reverted 'service/src/java/org/apache/hive/service/cli/operation/GetTableTypesOperation.java'
Reverted 'service/src/gen/thrift/gen-py/TCLIService/ttypes.py'
Reverted 'service/src/gen/thrift/gen-cpp/TCLIService_types.cpp'
Reverted 'service/src/gen/thrift/gen-cpp/TCLIService_types.h'
Reverted 'service/src/gen/thrift/gen-rb/t_c_l_i_service_types.rb'
Reverted 'service/src/gen/thrift/gen-javabean/org/apache/hive/service/cli/thrift/TFetchResultsReq.java'
Reverted 'service/if/TCLIService.thrift'
++ egrep -v '^X|^Performing status on external'
++ awk '{print $2}'
++ svn status --no-ignore
+ rm -rf target datanucleus.log ant/target shims/target shims/0.20/target shims/0.20S/target shims/0.23/target shims/aggregator/target shims/common/target shims/common-secure/target packaging/target hbase-handler/target testutils/target jdbc/target metastore/target itests/target itests/hcatalog-unit/target itests/test-serde/target itests/qtest/target itests/hive-unit-hadoop2/target itests/hive-minikdc/target itests/hive-unit/target itests/custom-serde/target itests/util/target hcatalog/target hcatalog/core/target hcatalog/streaming/target hcatalog/server-extensions/target hcatalog/hcatalog-pig-adapter/target hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target accumulo-handler/target hwi/target common/target common/src/gen service/target service/src/test/org/apache/hive/service/cli/operation service/src/java/org/apache/hive/service/cli/FetchType.java service/src/java/org/apache/hive/service/cli/operation/OperationLog.java service/src/java/org/apache/hive/service/cli/operation/LogDivertAppender.java contrib/target serde/target beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target
+ svn update
U    ql/src/test/queries/clientpositive/optimize_nullscan.q
U    ql/src/test/results/clientpositive/optimize_nullscan.q.out
U    ql/src/test/results/clientpositive/tez/optimize_nullscan.q.out

Fetching external item into 'hcatalog/src/test/e2e/harness'
Updated external to revision 1620682.

Updated to revision 1620682.
+ patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hive-ptest/working/scratch/build.patch
+ [[ -f /data/hive-ptest/working/scratch/build.patch ]]
+ chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh
+ /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch
The patch does not appear to apply with p0, p1, or p2
+ exit 1
'

This message is automatically generated.

ATTACHMENT ID: 12664368

asfimport commented 10 years ago

Sathish: Thanks Ryan, Based on your comments it looks like no particular changed needed in the Hive serde side for handling Non nullable arrays and Parquet Avro library needs to be fixed to convert the schema format properly. I am planning to work on fixing the parquet Avro library for properly generating the parquet files with schema understandable by the Hive, I will update my findings once my changes are done.

asfimport commented 10 years ago

Sathish: Submitted my changes as a pull request to the parquet-mr branch, The details of the changes made in parquet-avro branch are shown below, https://github.com/apache/incubator-parquet-mr/pull/47

Can you review and provide suggestions on this fix.

asfimport commented 9 years ago

Ryan Blue / @rdblue: This issue is fixed by HIVE-8909. That issue includes several tests that verify Hive can read existing data with arrays created by Avro and Thrift.

asfimport commented 9 years ago

Daniel Haviv: Hi Ryan, Does it mean it's included in the nightly build?

Thanks Daniel

asfimport commented 9 years ago

Ryan Blue / @rdblue: [~danielil], which nightly build are you referring to? It would be in Hive nightly builds because it's in trunk, but this hasn't been backported to the parquet-hive module in Parquet-mr so it wouldn't be there.

asfimport commented 9 years ago

Daniel Haviv: Hi, It seems like something is broken now.. I ran this query on a parquet from 0.13: 0: jdbc:hive2://hdname:10000/default> select count(*) from A1; ----------- ----------- ----------- 1 row selected (48.401 seconds)

and when I run it from 0.15 I get: 2014-11-26 02:01:54,556 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:312) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:259) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:386) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:652) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:168) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:298) ... 11 more Caused by: java.lang.IllegalStateException: All the offsets listed in the split should be found in the file. expected: [4, 4] found: [BlockMetaData{1560100, 413986404 [ ColumnMetaData{GZIP [adformat] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 4}, ColumnMetaData{GZIP [adspaces] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 336814}, Col umnMetaData{GZIP [age] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 589854}, ColumnMetaData{GZIP [app_id] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 625872}, ColumnMe taData{GZIP [app_name] BINARY [BIT_PACKED, RLE, PLAIN], 2112900}, ColumnMetaData{GZIP [bs_height] INT32 [BIT_PACKED, RLE, PLAIN_DICTIONARY], 2112949}, ColumnMetaData{ GZIP [bs_width] INT32 [BIT_PACKED, RLE, PLAIN_DICTIONARY], 2162163}, ColumnMetaData{GZIP [categories, array] BINARY [RLE, PLAIN_DICTIONARY], 2211377}, ColumnMetaData{ GZIP [computer_id] INT32 [BIT_PACKED, RLE, PLAIN_DICTIONARY], 3835760}, ColumnMetaData{GZIP [deviceIdType] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 3836706}, Colum nMetaData{GZIP [deviceInfo, brand_name] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 4189159}, ColumnMetaData{GZIP [deviceInfo, device_matching_result] BINARY [BIT_PAC KED, RLE, PLAIN_DICTIONARY], 5293759}, ColumnMetaData{GZIP [deviceInfo, device_os] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 5434944}, ColumnMetaData{GZIP [deviceInf o, is_opera] BOOLEAN [BIT_PACKED, RLE, PLAIN], 5801709}, ColumnMetaData{GZIP [deviceInfo, is_tablet] BOOLEAN [BIT_PACKED, RLE, PLAIN], 5802020}, ColumnMetaData{GZIP [ deviceInfo, is_touch_screen] BOOLEAN [BIT_PACKED, RLE, PLAIN], 5966941}, ColumnMetaData{GZIP [deviceInfo, model_name] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 6088 909}, ColumnMetaData{GZIP [deviceInfo, screen_height] INT32 [BIT_PACKED, RLE, PLAIN_DICTIONARY], 8287356}, ColumnMetaData{GZIP [deviceInfo, screen_width] INT32 [BIT_P ACKED, RLE, PLAIN_DICTIONARY], 9167938}, ColumnMetaData{GZIP [deviceInfo, user_agent] BINARY [BIT_PACKED, RLE, PLAIN, PLAIN_DICTIONARY], 10078081}, ColumnMetaData{GZIP [device_id_hash] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 30019984}, ColumnMetaData{GZIP [gender] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 30155884}, ColumnMet aData{GZIP [geoLocationInfo, carrier] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 30193050}, ColumnMetaData{GZIP [geoLocationInfo, city] BINARY [BIT_PACKED, RLE, PLAI N_DICTIONARY], 32588517}, ColumnMetaData{GZIP [geoLocationInfo, connection_type] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 35068217}, ColumnMetaData{GZIP [geoLocatio nInfo, country] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 35260730}, ColumnMetaData{GZIP [geoLocationInfo, ip] BINARY [BIT_PACKED, RLE, PLAIN, PLAIN_DICTIONARY], 36 396542}, ColumnMetaData{GZIP [geoLocationInfo, ip_routing_type] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 45147363}, ColumnMetaData{GZIP [geoLocationInfo, is_cache] BOOLEAN [BIT_PACKED, RLE, PLAIN], 45339937}, ColumnMetaData{GZIP [geoLocationInfo, is_longlat_from_req] BOOLEAN [BIT_PACKED, RLE, PLAIN], 45340248}, ColumnMetaData{GZ IP [geoLocationInfo, latitude] DOUBLE [BIT_PACKED, RLE, PLAIN_DICTIONARY], 45533425}, ColumnMetaData{GZIP [geoLocationInfo, longitude] DOUBLE [BITPACKED, RLE, PLAIN DICTIONARY], 48382336}, ColumnMetaData{GZIP [geoLocationInfo, mobile_operator] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 51375539}, ColumnMetaData{GZIP [geoLocationI nfo, postal_code] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 52219982}, ColumnMetaData{GZIP [geoLocationInfo, region] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 532 57246}, ColumnMetaData{GZIP [handling_time] INT32 [BIT_PACKED, RLE, PLAIN_DICTIONARY], 54739550}, ColumnMetaData{GZIP [impression_id] BINARY [BIT_PACKED, RLE, PLAIN], 56296474}, ColumnMetaData{GZIP [markup] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 75636083}, ColumnMetaData{GZIP [mode] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 75903538}, ColumnMetaData{GZIP [publisher_id] INT32 [BIT_PACKED, RLE, PLAIN_DICTIONARY], 75905807}, ColumnMetaData{GZIP [referrer] BINARY [BIT_PACKED, RLE, PLAIN], 7 6560085}, ColumnMetaData{GZIP [request_date] INT64 [BIT_PACKED, RLE, PLAIN, PLAIN_DICTIONARY], 76560134}, ColumnMetaData{GZIP [request_id] BINARY [BIT_PACKED, RLE, PL AIN], 79416424}, ColumnMetaData{GZIP [request_source] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 95503559}, ColumnMetaData{GZIP [request_type] INT32 [BIT_PACKED, RLE , PLAIN_DICTIONARY], 95770325}, ColumnMetaData{GZIP [selected_campaigns] BINARY [BIT_PACKED, RLE, PLAIN_DICTIONARY], 95770977}, ColumnMetaData{GZIP [site_id] BINARY [ BIT_PACKED, RLE, PLAIN_DICTIONARY], 96469724}, ColumnMetaData{GZIP [source_ip] BINARY [BIT_PACKED, RLE, PLAIN, PLAIN_DICTIONARY], 97567391}, ColumnMetaData{GZIP [udid] BINARY [BIT_PACKED, RLE, PLAIN, PLAIN_DICTIONARY], 102171524}, ColumnMetaData{GZIP [user_id] BINARY [BIT_PACKED, RLE, PLAIN, PLAIN_DICTIONARY], 110665022}]}] out of: [4, 111186045, 221785906, 332758529, 450099685, 566558359, 677625032] in range 0, 134217728 at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:180) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:99) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:71) at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat$VectorizedParquetRecordReader.(VectorizedParquetInputFormat.java:63) at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:153) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:65) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:65) ... 16 more

and

2014-11-26 02:01:55,579 INFO [main] org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator: 1 Close done 2014-11-26 02:01:55,579 INFO [main] org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 Close done 2014-11-26 02:01:55,579 INFO [main] org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator: 11 Close done 2014-11-26 02:01:55,583 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: java.io.IOException: java.lang.NullPointerExcepti on at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:273) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:183) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:198) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:184) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:352) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:115) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:271) ... 11 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:191) at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat$VectorizedParquetRecordReader.next(VectorizedParquetInputFormat.java:117) at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat$VectorizedParquetRecordReader.next(VectorizedParquetInputFormat.java:49) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:347) ... 15 more

My parquet files were generated by Spark (I can share them if you need them for testing purposes).

Daniel

On Tue, Nov 25, 2014 at 7:47 PM, Ryan Blue (JIRA) jira@apache.org wrote:

asfimport commented 9 years ago

Ryan Blue / @rdblue: [~danielil], I wouldn't expect this to be fixed in your environment if you're running Hive 0.13.x. This is currently in the Hive trunk, but hasn't been released. If you were using nightly builds of Hive, then you would see the fix.

asfimport commented 9 years ago

Pranav Singh: Can you please tell which version this fix is going to be included in? Cloudera's CDH 5.3 says there are lot of hive, parquet fixes - does this one got included?

asfimport commented 9 years ago

Ryan Blue / @rdblue: [~pranavkrs], HIVE-8909 is in CDH 5.3. For that release, Hive should be able to read files created by parquet-avro, parquet-thrift, and parquet-hive that use the container types.