TIBCOSoftware / snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
http://www.snappydata.io
Other
1.04k stars 200 forks source link

Persisting Column table to HDFS in Snappy #400

Open shivani-gupta opened 8 years ago

shivani-gupta commented 8 years ago

I am trying to persist column tables in HDFS. I have created a HDFSSTORE as pointed in the config docs and when I try to use the same while creating the Column table through snappy shell, the 'HDFSSTORE' keyword is not identified. Following is the syntax Which I used to create the Column table. ( I was able to create the HDFSStore successfully).

CREATE TABLE if not exists SnappyColumnTable USING column OPTIONS(PARTITION BY 'id', buckets '11', PERSISTENT 'SYNCHRONOUS', HDFSSTORE 'test')

AFAIK the gemfire xd HDFSSTORE documentation is used for the Row table. Should we use the same syntax for the column store as well ? . Can we store the Column tables in HDFS? . If yes, please point me to the correct documentation.

Environment

Snappy0.6 RC1 Node 1 -> Lead + Locator Node 2,3,4 -> Servers

thanks in advance.

kneeraj commented 8 years ago

AFAIK the gemfire xd HDFSSTORE documentation is used for the Row table.

Yes.

Should we use the same syntax for the column store as well ? . Can we store the Column tables in HDFS?

Column tables cannot be stored in HDFS by this scheme. Sincere apologies for inconvenience. We are in the process of making our documentation more informative about difference in row and column tables support and also throw proper UnsupportedException for all such scenarios.

Alternative solution: How are you inserting the data in Column table? Is it through a DataSet and then writing that into a column table or is it through jdbc. If you are doing it through jdbc then we cannot move that to HDFS as of now. But if you are creating a dataset then you can use the standard way of writing the dataset into hdfs through the apis available. Like dataset.write.csv(hdfs_path) etc.

shivani-gupta commented 8 years ago

Thank you for your response. We are ingesting incoming events through a streaming table and appending the data using dataframes to a column table. We want a low latency writes( append) to the column data store hence we are using the snappy column table to run some queries immediately after appending the data. Where is the column table data is stored in Snappy? And how safe is to store more data ( 100 Million + records) in the column table?. Is there any replicating feature available for the column table.

thanks in advance

kneeraj commented 8 years ago

The column data is stored in the memory of the snappydata servers. It can be configured to be just in memory or both in memory and on disk. There is an option called REDUNDANCY which can be provided during create table to configure redundancy. Giving it redundancy will make it highly available and adding 'persistent' will make it safe against events of unexpected jvm crashes and cluster shutdown. Please refer to this: http://snappydatainc.github.io/snappydata/rowAndColumnTables/

sumwale commented 7 years ago

Automatic persistence of column tables to parquet with hadoop APIs is planned for near future. For now it can be directly persisted via DataFrameWriter Spark APIs. Will update this ticket when work starts on automatic persistence.