Closed mayureshkunjir closed 11 years ago
This is a minor bug, I can fix. To clarity, it does not create one file per node. The number of files per node == total num of splits / num of machines.
Yet, inside each machine, only one generator thread is running. That's why it seems that it creates one file per node.
In the latter version, we'd better refine this to fully utilize the parallelism of each machine.
We need to create a number of files so that they are greater than/2. By default, it seems to create one file per node. I got the following error when I tried to generate 100GB data.
Generating TPCDS data 13/07/31 18:32:21 INFO datagen.DataGenerator: Starting data generation at: Wed Jul 31 18:32:21 PDT 2013 13/07/31 18:32:21 INFO datagen.DataGenerator: 13/07/31 18:32:21 INFO datagen.DataGenerator: Input Parameters: 13/07/31 18:32:21 INFO datagen.DataGenerator: Scale Factor: 33 13/07/31 18:32:21 INFO datagen.DataGenerator: Number of Files: 16 13/07/31 18:32:21 INFO datagen.DataGenerator: Host List: /home/kunjirm/hadoop-lava/conf/slaves 13/07/31 18:32:21 INFO datagen.DataGenerator: Local Directory: /data/spark/tpcds 13/07/31 18:32:21 INFO datagen.DataGenerator: HDFS Directory: relational_data 13/07/31 18:32:21 INFO datagen.DataGenerator: 13/07/31 18:32:21 INFO datagen.DataGenerator: ERROR: The number of files must be greater than half the scale factor