Open zqhxuyuan opened 9 years ago
What compaction are you using? Size tiered, leveled?
As configured your splits are bigger than your sstables which is why you are seeing a 1:1 between sstable and mapper. With files this small it's almost not worth splitting further. Aggregating the small files into larger ones to get fewer mappers is not something that is currently possible with hadoop-sstable as it was designed to solve the opposite problem, very large sstables. Map reduce is generally not well suited to lots of small files. If your job is running slow is there any possibility of running with more nodes?
Leveled compaction in C. most file size is 160M, as generate by C and can't change it. default -D hadoop.sstable.split.mb=1000. So sstable size large than 1G may be better to do it. I have found sstablesplit tool which split large sstable to small, but not found merge from small to large. More sadly, now even add nodes, it can't be too much more than 50 nodes. So I'll figure out other way.
Yeah, FTR we have jobs that run against sstables with leveled compaction. Which results in many files and many mappers but runs in a matter of hours. I don't know what your environment is but ours look like:
AWS EMR 64 nodes m2.4xlarge
Move those over to SSD instances and things go very fast.
Recenty we want to migration data from C to HDFS, and Here is the Mapper:
Because we have 2 regular column(event and sequence_id). so mapper output column like this:
And we aggregation One Row by :
PartitionKey:cluster-key-values:regularColumn1Value:regularColumn2Value
In this way, one row like this looks more like CQL result or DBMS one row.Our one SSTable file size almost 160M from Cassandra, put to HDFS(Block size=128MB):
have 1674 Data.db files(Almost 300G data from C*):
After running in cluster(11Nodes), I see MapTask number is the same as Data.db files:
And ofcourse this job will take many time. I set -D mapred.map.tasks=180, but map task number still keep 1674.
I guess map task number can't assign, As it read from HDFS InputSplit.
Is there any way to run MR job quickly? Or Does Yarn application will decrease running time?