edwardcapriolo / filecrush

Remedy small files by combining them into larger ones.
193 stars 120 forks source link

Better Logging needed when fileformat is in question #9

Open alexmc6 opened 9 years ago

alexmc6 commented 9 years ago

I managed to run filecrush for the first time and after everything seemed to finish successfully I got this error. In fact although it reported loads of files to crush it did not crush any...

Exception in thread "main" java.io.IOException: not a gzip file at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.processBasicHeader(BuiltInGzipDecompressor.java:496) at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeHeaderState(BuiltInGzipDecompressor.java:257) at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:186) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:72) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2281) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2304) at com.m6d.filecrush.crush.Crush.cloneOutput(Crush.java:769) at com.m6d.filecrush.crush.Crush.run(Crush.java:666) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at com.m6d.filecrush.crush.Crush.main(Crush.java:1330) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

My command line

hadoop jar ./filecrush-2.2.2-SNAPSHOT.jar com.m6d.filecrush.crush.Crush --info --clone --verbose --compress gzip --input-format text --output-format text /user/camus/tests/topics/ /user/camus/tests/topics_orig/ 20101121121212

Why does it say "SequenceFile"? I have gzipped json (ie text). Soon to be snappy json

alexmc6 commented 9 years ago

Invoking map reduce

15/04/10 15:51:46 INFO impl.TimelineClientImpl: Timeline service address: http://bruathdp002.redacted.local:8188/ws/v1/timeline/ 15/04/10 15:51:46 INFO client.RMProxy: Connecting to ResourceManager at bruathdp002.redacted.local/10.34.37.2:8050 15/04/10 15:51:46 INFO impl.TimelineClientImpl: Timeline service address: http://bruathdp002.redacted.local:8188/ws/v1/timeline/ 15/04/10 15:51:46 INFO client.RMProxy: Connecting to ResourceManager at bruathdp002.redacted.local/10.34.37.2:8050 15/04/10 15:51:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 16477 for camus on ha-hdfs:uatcluster 15/04/10 15:51:46 INFO security.TokenCache: Got dt for hdfs://uatcluster; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:uatcluster, Ident: (HDFS_DELEGATION_TOKEN token 16477 for camus) 15/04/10 15:51:46 INFO mapred.FileInputFormat: Total input paths to process : 1 15/04/10 15:51:46 INFO mapred.FileInputFormat: Total input paths to process : 1 15/04/10 15:51:46 INFO mapreduce.JobSubmitter: number of splits:3 15/04/10 15:51:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1427208025863_5476 15/04/10 15:51:46 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:uatcluster, Ident: (HDFS_DELEGATION_TOKEN token 16477 for camus) 15/04/10 15:51:47 INFO impl.YarnClientImpl: Submitted application application_1427208025863_5476 15/04/10 15:51:47 INFO mapreduce.Job: The url to track the job: http://bruathdp002.redacted.local:8088/proxy/application_1427208025863_5476/ 15/04/10 15:51:47 INFO mapreduce.Job: Running job: job_1427208025863_5476 15/04/10 15:51:53 INFO mapreduce.Job: Job job_1427208025863_5476 running in uber mode : false 15/04/10 15:51:53 INFO mapreduce.Job: map 0% reduce 0% 15/04/10 15:52:00 INFO mapreduce.Job: map 33% reduce 0% 15/04/10 15:52:01 INFO mapreduce.Job: map 100% reduce 0% 15/04/10 15:52:10 INFO mapreduce.Job: map 100% reduce 68% 15/04/10 15:52:16 INFO mapreduce.Job: map 100% reduce 69% 15/04/10 15:52:28 INFO mapreduce.Job: map 100% reduce 70% 15/04/10 15:52:47 INFO mapreduce.Job: map 100% reduce 71% 15/04/10 15:53:02 INFO mapreduce.Job: map 100% reduce 72% 15/04/10 15:53:05 INFO mapreduce.Job: map 100% reduce 73% 15/04/10 15:53:11 INFO mapreduce.Job: map 100% reduce 74% 15/04/10 15:53:14 INFO mapreduce.Job: map 100% reduce 75% 15/04/10 15:53:17 INFO mapreduce.Job: map 100% reduce 76% 15/04/10 15:53:20 INFO mapreduce.Job: map 100% reduce 77% 15/04/10 15:53:26 INFO mapreduce.Job: map 100% reduce 78% 15/04/10 15:53:29 INFO mapreduce.Job: map 100% reduce 79% 15/04/10 15:53:32 INFO mapreduce.Job: map 100% reduce 80% 15/04/10 15:53:38 INFO mapreduce.Job: map 100% reduce 81% 15/04/10 15:53:51 INFO mapreduce.Job: map 100% reduce 82% 15/04/10 15:54:06 INFO mapreduce.Job: map 100% reduce 83% 15/04/10 15:54:09 INFO mapreduce.Job: map 100% reduce 84% 15/04/10 15:54:15 INFO mapreduce.Job: map 100% reduce 85% 15/04/10 15:54:34 INFO mapreduce.Job: map 100% reduce 86% 15/04/10 15:54:40 INFO mapreduce.Job: map 100% reduce 87% 15/04/10 15:54:46 INFO mapreduce.Job: map 100% reduce 88% 15/04/10 15:54:58 INFO mapreduce.Job: map 100% reduce 89% 15/04/10 15:55:13 INFO mapreduce.Job: map 100% reduce 90% 15/04/10 15:55:22 INFO mapreduce.Job: map 100% reduce 91% 15/04/10 15:55:31 INFO mapreduce.Job: map 100% reduce 92% 15/04/10 15:55:34 INFO mapreduce.Job: map 100% reduce 93% 15/04/10 15:55:37 INFO mapreduce.Job: map 100% reduce 94% 15/04/10 15:55:43 INFO mapreduce.Job: map 100% reduce 95% 15/04/10 15:55:49 INFO mapreduce.Job: map 100% reduce 96% 15/04/10 15:55:55 INFO mapreduce.Job: map 100% reduce 97% 15/04/10 15:56:01 INFO mapreduce.Job: map 100% reduce 98% 15/04/10 15:56:07 INFO mapreduce.Job: map 100% reduce 99% 15/04/10 15:56:10 INFO mapreduce.Job: map 100% reduce 100% 15/04/10 15:56:11 INFO mapreduce.Job: Job job_1427208025863_5476 completed successfully 15/04/10 15:56:11 INFO mapreduce.Job: Counters: 57 File System Counters FILE: Number of bytes read=1835766 FILE: Number of bytes written=4168355 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=560020818 HDFS: Number of bytes written=554471025 HDFS: Number of read operations=24674 HDFS: Number of large read operations=0 HDFS: Number of write operations=918 Job Counters Launched map tasks=3 Launched reduce tasks=1 Data-local map tasks=3 Total time spent by all maps in occupied slots (ms)=30502 Total time spent by all reduces in occupied slots (ms)=995608 Total time spent by all map tasks (ms)=15251 Total time spent by all reduce tasks (ms)=248902 Total vcore-seconds taken by all map tasks=15251 Total vcore-seconds taken by all reduce tasks=248902 Total megabyte-seconds taken by all map tasks=62468096 Total megabyte-seconds taken by all reduce tasks=2039005184 Map-Reduce Framework Map input records=8221 Map output records=8220 Map output bytes=1812988 Map output materialized bytes=1835778 Input split bytes=816 Combine input records=0 Combine output records=0 Reduce input groups=916 Reduce shuffle bytes=1835778 Reduce input records=8220 Reduce output records=8220 Spilled Records=16440 Shuffled Maps =3 Failed Shuffles=0 Merged Map outputs=3 GC time elapsed (ms)=735 CPU time spent (ms)=158640 Physical memory (bytes) snapshot=3666952192 Virtual memory (bytes) snapshot=19778822144 Total committed heap usage (bytes)=4359979008 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 com.m6d.filecrush.crush.MapperCounter DIRS_ELIGIBLE=916 DIRS_FOUND=1619 DIRS_SKIPPED=703 FILES_ELIGIBLE=8220 FILES_FOUND=8541 FILES_SKIPPED=321 com.m6d.filecrush.crush.ReducerCounter FILES_CRUSHED=8220 RECORDS_CRUSHED=5426454 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=83317