USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Added a tool to dump sequence file records as raw files #153

Closed thammegowda closed 6 years ago

thammegowda commented 6 years ago

What changes were proposed in this pull request?

A tool to create raw files from sequence file records

Is this related to an already existing issue on sparkler?
Yes, #151

Will it close an existing issue?
Say 'Closes #151

How was this patch tested?

By exporting a few job directories

Usage:

bin/sparkler.sh dump -i $JOB_DIR -o $ROOT

Full usage:

$  bin/sparkler.sh dump
 -i (--job-dir, --in) VAL    : Sparkler Output Directory Containing Sequence
                               Files
 -o (--dump-root, --out) VAL : Directory to store raw files
 -sc (--skip-content)        : Writes the index file and skips content write
                               step
chrismattmann commented 6 years ago

we need this tool it's great thank you @thammegowda !