datasalt / splout-db

A web-latency SQL spout for Hadoop.
50 stars 14 forks source link

How to run the pagecounts example inside? #38

Closed tongping closed 10 years ago

tongping commented 10 years ago

I found that there is an example pagecounts under src/main/java/com/splout/db/examples/. However, how to compile this example and run it. Is there a tutorial to show this?

BTW, if we have to add some functionality to splout, for example, wordcount, how should we add code? I appreciate if there is an example to do this. Thanks so much!

pereferrera commented 10 years ago

Hi Tongping,

You can run the pagecounts from the distribution folder with the following command, assuming you have uploaded the "examples" folder to the HDFS:

hadoop jar splout-hadoop-0.2.6-SNAPSHOT-hadoop.jar pagecounts -i examples/pagecounts/pagecounts-sample -np 2 -o out-pagecounts

This will create the tablespace with 2 partitions.

This example's source code is a good example to learn how to use the splout-hadoop Java API: https://github.com/datasalt/splout-db/blob/master/splout-hadoop/src/main/java/com/splout/db/examples/PageCountsExample.java

The splout-hadoop Java API allows you to define tablespaces programmatically. It also allows you to create custom RecordProcessors (as in the PageCountsExample). But you pretty much have to have your input prepared to be indexed, in the sense that if you want to index a WordCount, you should execute the WordCount first and then the Splout indexing. It's not like you use Splout to compute things, but rather you use Splout to index things you have computed.

The most common input formats for Splout are CSV/TSV, but you can use other kinds of data as well. There is also some integration with Hive / Pig and Cascading (see the user guide for that). If you have binary data, you would need to use a custom input format that implements the interface InputFormat<ITuple, NullWritable>. For more clarity, see the "addX" methods in TableBuilder: https://github.com/datasalt/splout-db/blob/master/splout-hadoop/src/main/java/com/splout/db/hadoop/TableBuilder.java

Please don't hesitate in asking more. We know we have to work more on the documentation, examples and guides. Splout is nonetheless a production-proven system, as we have been running it in various clients with success. It only may take a bit of time for new people to make it work with their use case, but we're here to help

tongping commented 10 years ago

Thanks!

This works.