Closed oliverzy closed 9 years ago
Hi
Here is the code I used to test the handler
In Cassandra:
create keyspace DEMO; use DEMO;
create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type; set Users[1234][name] = scott; set Users[1234][password] = tiger; get Users[1234]; => (column=name, value=scott, timestamp=1364223476937000) => (column=password, value=tiger, timestamp=1364223823037000) Returned 2 results. Elapsed time: 23 msec(s).
In Hive:
CREATE EXTERNAL TABLE cassandra_table (key string, colname string, value string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES("cassandra.columns.mapping" = ":key,:column,:value" , "cassandra.cf.name" = "Users" , "cassandra.host" = "127.0.0.1" , "cassandra.port" = "9160", "cassandra.partitioner" = "org.apache.cassandra.dht.RandomPartitioner" ) TBLPROPERTIES ("cassandra.ks.name" = "DEMO"); select * from cassandra_table;
Here is the output:
1234 name scott 1234 password tiger
This made me believe that the handler was working the same way as the original storage handler found at
https://github.com/riptano/hive/tree/hive-0.8.1-merge/cassandra-handler
I am not sure why you see duplicate rows but I suspect that the original handler would likely behave the same way.
I will try to find some time later this week to take a closer look at this problem.
Hi,
Thanks for quick reply. I tried the same code as you posted and still get duplicated results:
1234 name scott 1234 password tiger 1234 name scott 1234 password tiger
That's weird. I am using datastax cassandra 1.2.3 community edition and hive 0.9, hadoop 1.1.2 on local mode. Operating system is MacOS.
What about your hive-site.xml? Is there any special configuration in it?
Yes, this is strange.. I tried your example and I am still getting a single record:
In Cassandra:
Use DEMO;
CREATE COLUMN FAMILY users2 WITH comparator = UTF8Type AND key_validation_class=UTF8Type AND column_metadata = [ {column_name: full_name, validation_class: UTF8Type} {column_name: email, validation_class: UTF8Type} {column_name: state, validation_class: UTF8Type} {column_name: gender, validation_class: UTF8Type} {column_name: birth_year, validation_class: LongType} ];
set users2['1234']['full_name'] = 'John Doe'; get users2[1234];
[default@DEMO] get users2[1234] ... ; => (column=full_name, value=John Doe, timestamp=1367853483058000) Returned 1 results. Elapsed time: 13 msec(s)
In Hive:
DROP TABLE cassandra_users2; CREATE EXTERNAL TABLE cassandra_users2 (key string, full_name string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES("cassandra.columns.mapping" = ":key,full_name" , "cassandra.cf.name" = "users2" , "cassandra.host" = "127.0.0.1" , "cassandra.port" = "9160", "cassandra.partitioner" = "org.apache.cassandra.dht.RandomPartitioner" ) TBLPROPERTIES ("cassandra.ks.name" = "DEMO");
select * from cassandra_users2;
The output:
13/05/06 10:31:11 DEBUG hadoop.ColumnFamilyRecordReader: Finished scanning 1 rows (estimate was: 128) 1234 John Doe 13/05/06 10:31:11 INFO exec.TableScanOperator: 0 finished. closing...
My environment:
Apache Casandra 1.2.3 Hadoop 0.20.2 Hive 0.10.0 Red Hat Enterprise Linux 6.2
My hive-site.xml looks like this (sensitive information marked out):
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <configuration> <property> <name>hive.hwi.listen.port</name> <value>9999</value> </property> <property> <name>hive.querylog.location</name> <value>/var/XXXXXXXXXXXXXXX</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>hive.metastore.local</name> <value>false</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://XXXXXXXXXXXXXXX.com/hivemetastoredb?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.mapping.Schema</name> <value>HIVE</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>XXXXXXXXXXXXXXX</value> </property> <property> <name>org.jpox.autoCreateSchema</name> <value>true</value> </property>
<property> <name>hive.metastore.uris</name> <value>thrift://XXXXXXXXXXXXXXX:9091</value> </property>
<property> <name>hive.semantic.analyzer.factory.impl</name> <value>org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory</value> </property>
<property> <name>fs.default.name</name> <value>hdfs://XXXXXXXXXXXXXXX.com:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
</configuration>
Hi,
It looks like your hadoop is running under a cluster setup, however, I am running on local mode. And I tried Hive 0.1.0, still no luck.
Closed. Fixed upstream.
Hi Dmitry,
I try your build with cassandra 1.2.3/hive 0.9.0, I have a issue that I always get the duplicated records in Hive. Cassandra column family: CREATE COLUMN FAMILY users WITH comparator = UTF8Type AND key_validation_class=UTF8Type AND column_metadata = [ {column_name: full_name, validation_class: UTF8Type} {column_name: email, validation_class: UTF8Type} {column_name: state, validation_class: UTF8Type} {column_name: gender, validation_class: UTF8Type} {column_name: birth_year, validation_class: LongType} ];
Hive Table: CREATE EXTERNAL TABLE IF NOT EXISTS users (key string, full_name string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,users:full_name" , "cassandra.cf.name" = "users") TBLPROPERTIES ("cassandra.ks.name" = "ks33");
Hive Query: select * from users; always return duplicated rows (one row appears twice) select count(1) from users; return 2 but exactly I only insert one row. Do you have any idea why this happen?