RunningJon / TPC-DS

Greenplum TPC-DS benchmark
113 stars 96 forks source link

Could not allocate enough memory #7

Closed dimon777 closed 7 years ago

dimon777 commented 7 years ago

I got this error when running rollout.sh:

Started gpfdist on port 9005
executing on hadoop01.home.net ./start_gpfdist.sh 9006 /data/hawq/segmentdd/pivotalguru_6
Started gpfdist on port 9006
psql -v ON_ERROR_STOP=ON -f /home/hawq/TPC-DS/04_load/051.insert.call_center.sql | grep INSERT | awk -F ' ' '{print $3}'
psql:/home/hawq/TPC-DS/04_load/051.insert.call_center.sql:1: ERROR:  Could not allocate enough memory! bucket number of result hash table and external table should match each other (cdbdatalocality.c:4203)

I am using latest hawq repository code. CentOS 7, 8 node CDH Hadoop distribution. Everything seems to run just fine, including Hawq itself. Please advise.

RunningJon commented 7 years ago

This isn't an issue with the TPC-DS scripts but your configuration for HAWQ. You don't have enough memory configured for HAWQ.

How many nodes? How much RAM per node? How much swap per node? How much RAM is being used by other processes? Are you using using YARN or the default resource manager? Are you using randomly distributed tables?

What are the values for these GUCs?

hawq_rm_stmt_vseg_memory hawq_rm_memory_limit_perseg default_hash_table_bucket_number hawq_rm_nvseg_perquery_perseg_limit

And your OS settings: vm.overcommit_ratio vm.overcommit

On Sun, Jan 8, 2017 at 6:20 PM, dimon777 notifications@github.com wrote:

I got this error when running rollout.sh:

Started gpfdist on port 9005 executing on hadoop01.home.net ./start_gpfdist.sh 9006 /data/hawq/segmentdd/pivotalguru_6 Started gpfdist on port 9006 psql -v ON_ERROR_STOP=ON -f /home/hawq/TPC-DS/04_load/051.insert.call_center.sql | grep INSERT | awk -F ' ' '{print $3}' psql:/home/hawq/TPC-DS/04_load/051.insert.call_center.sql:1: ERROR: Could not allocate enough memory! bucket number of result hash table and external table should match each other (cdbdatalocality.c:4203)

Please advise.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pivotalguru/TPC-DS/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AFObREtHzszu8RV-pxFBLuHKIVC2PMN3ks5rQX1UgaJpZM4Ld2oP .

dimon777 commented 7 years ago

How many nodes? 6 datanodes, 8 nodes total How much RAM per node? 8GB How much swap per node? 16GB How much RAM is being used by other processes? 1.3GB is used by CDH processes on each node. 6.5GB is free. Are you using using YARN or the default resource manager? For Hawq the default RM is used. Are you using randomly distributed tables? I am not sure about this, since it is TPC-DS generated tables What are the values for these GUCs? hawq_rm_stmt_vseg_memory 128mb hawq_rm_memory_limit_perseg 4GB default_hash_table_bucket_number 24 hawq_rm_nvseg_perquery_perseg_limit 6

OS settings: vm.overcommit_ratio = 50 vm.overcommit = this parameter doesn't exist in CentOS7.1

Thank you.

dimon777 commented 7 years ago

I see the code says this:

if ((context.resultRelationHashSegNum < context.externTableForceSegNum
        && context.externTableForceSegNum != 0)
        || (context.resultRelationHashSegNum < context.externTableLocationSegNum)) {
    elog(ERROR, "Could not allocate enough memory! "
        "bucket number of result hash table and external table should match each other");
}

What should I adjust to avoid this assertion? I don't know how to translate these context parameters to GUCs (if my case is indeed a misconfiguration).

RunningJon commented 7 years ago

When you run tpcds.sh, it creates the tpcds_variables.sh file. One of the variables is "RANDOM_DISTRIBUTION" which is set to false by default. I have this default because this is used for Greenplum more frequently and hash distribution works best in that database. For HAWQ, you'll want to set this variable to "true" so that it creates the tables with random distribution.

The default scale is also 3TB which I'm not sure is what you want to use. Change GEN_DATA_SCALE to a value that makes sense for your cluster.

CDH isn't supported. Cloudera isn't a member of ODPi probably because they don't want to compete directly with HAWQ. CentOS 7.1 is also not supported so you'll want to use RHEL 6. But, vm.overcommit just needs to be added to /etc/sysctl.conf and it should be set to 2. It should be set to this on all nodes. Hortonworks is a member of ODPi and you can also use the Apache ODPi binaries too.

You probably have more than just 1.3GB of memory being used for other processes. You probably have about 4GB allocated for the data node and the node manager plus about 2GB for the OS. In this case, you should set hawq_rm_memory_limit_perseg should be 2. 8GB - 6GB = 2GB

default_hash_table_bucket_number should be 6 nvseg per segment * 8 nodes = 48

On Sun, Jan 8, 2017 at 8:06 PM, dimon777 notifications@github.com wrote:

I see the code says this:

if ((context.resultRelationHashSegNum < context.externTableForceSegNum && context.externTableForceSegNum != 0) || (context.resultRelationHashSegNum < context.externTableLocationSegNum)) { elog(ERROR, "Could not allocate enough memory! " "bucket number of result hash table and external table should match each other"); }

What should I adjust to avoid this assertion? I don't know how to translate these context parameters to GUCs (if my case is indeed a misconfiguration).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pivotalguru/TPC-DS/issues/7#issuecomment-271199834, or mute the thread https://github.com/notifications/unsubscribe-auth/AFObRMvu5AcPIAnzXSdhVEWZ4oYPjbMlks5rQZY-gaJpZM4Ld2oP .

dimon777 commented 7 years ago

Thank you,

I will try these. Unfortunately I have no control to use CDH or Horton or CentOS version (v7 seems to be supported: https://cwiki.apache.org/confluence/display/HAWQ/Build+and+Install) . In my tests I have not seen any issues with this configuration. My goal is to test HAWQ on the available Hadoop platform and compare it with Impala.

RunningJon commented 7 years ago

It won't be a valid test if you are using CDH but you should try to run all 99 of the real TPC-DS queries in Impala. It can't do it. They cherry pick and modify the queries to work in their database.

On Mon, Jan 9, 2017 at 8:42 AM, dimon777 notifications@github.com wrote:

Thank you,

I will try these. Unfortunately I have no control to use CDH or Horton or CentOS version (v7 seems to be supported: https://cwiki.apache.org/ confluence/display/HAWQ/Build+and+Install http://url) . In my tests I have not seen any issues with this configuration. My goal is to test HAWQ on the available Hadoop platform and compare it with Impala.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pivotalguru/TPC-DS/issues/7#issuecomment-271300726, or mute the thread https://github.com/notifications/unsubscribe-auth/AFObRO2A3fVBySUjWhBNw6SsqUDPnBIbks5rQkc7gaJpZM4Ld2oP .

dimon777 commented 7 years ago

Yes, I am aware of Cloudera queries adjustment for Impala tpc-ds test. I also read Pivotal article on this. And I fully agree: the way Cloudera did this is unacceptable and misleading for those who rely on TPC-DS benchmark to make the judgment about the platform.

dimon777 commented 7 years ago

The issue was in my environment .bashrc didn't have GREENPLUM_PATH variable set. After adding this: export GREENPLUM_PATH=/usr/local/hawq/greenplum_path.sh and changing RANDOM_DISTRIBUTION flag to true in ./rollout.sh call test started working fine.