LLNL / magpie

Magpie contains a number of scripts for running Big Data software in HPC environments, including Hadoop and Spark. There is support for Lustre, Slurm, Moab, Torque. LSF, Flux, and more.
GNU General Public License v2.0
196 stars 53 forks source link

Hadoop Intel Lustre Locking Error on Name Node #238

Open jlaura opened 7 years ago

jlaura commented 7 years ago

I am using your wonderful set of scripts on a slurm cluster backed with a Lustre file system. Spark (without HDFS) is running quite well. I am struggling to HDFS to work and I believe it is due to the Intel Lustre plugin.

I have downloaded and built the plugin from this repo and installed the .jar into $HADOOP_HOME/hare/hadoop/common/lib/ as per the linked installation instructions. The linked instructions also have modifications to core-site.xml that differ from the modifications in $MAGPIE_HOME/conf/hadoop/core-site-2.X-intellustre.xml. Perhaps this is the source of the issue?

When submitting the hadoop (or hadoop with hbase) scripts, the nodes are allocated as expected and the lowest rank assigned as the name node (success!). Unfortunately, the name node is failing with:

86 2017-07-03 12:49:11,921 ERROR org.apache.hadoop.hdfs.server.common.Storage: Failed to acquire lock on /scratch/jlaura/hdfsoverlustre/node-0/dfs/name/in_use.lock. If this 87 java.io.IOException: Function not implemented

where /scratch/jlaura/hdfsoverlustre is being set as the HADOOP_HDFSOVERLUSTRE_PATH in the submission script. I have also enabled HADOOP_HDFSOVERLUSTRE_REMOVE_LOCKS=yes, but it appears that the locks are never being written.

What additional modification are possibly required to get hdfs over lustre working with hadoop?

Full name node log:

2017-07-03 12:49:10,875 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************                   
STARTUP_MSG: Starting NameNode                                                  
STARTUP_MSG:   host = neb1.ib.astrohpc/172.16.3.201                             
STARTUP_MSG:   args = []                                                        
STARTUP_MSG:   version = 2.7.3                                                  
STARTUP_MSG:   classpath = /tmp/hadoop/j_hbase/749160/conf:/home/jlaura/bigdata/hadoop-2.7.3/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/home/jlaura/bigdata/had
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff; compiled by 'root' on 2016-08-18T01:41Z
STARTUP_MSG:   java = 1.8.0_111                                                 
************************************************************/                   
2017-07-03 12:49:10,886 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2017-07-03 12:49:10,892 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: createNameNode []
2017-07-03 12:49:11,119 WARN org.apache.hadoop.metrics2.impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-namenode.properties,hadoop-metrics2.properties
2017-07-03 12:49:11,225 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2017-07-03 12:49:11,225 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started
2017-07-03 12:49:11,228 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: fs.defaultFS is hdfs://neb1:54310
2017-07-03 12:49:11,228 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Clients are to use neb1:54310 to access this namenode/service.
2017-07-03 12:49:11,365 INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at: http://neb1:50070
2017-07-03 12:49:11,432 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2017-07-03 12:49:11,440 INFO org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random
2017-07-03 12:49:11,449 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.namenode is not defined
2017-07-03 12:49:11,458 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2017-07-03 12:49:11,460 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to co
2017-07-03 12:49:11,460 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to co
2017-07-03 12:49:11,460 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to co
2017-07-03 12:49:11,584 INFO org.apache.hadoop.http.HttpServer2: Added filter 'org.apache.hadoop.hdfs.web.AuthFilter' (class=org.apache.hadoop.hdfs.web.AuthFilter)
2017-07-03 12:49:11,585 INFO org.apache.hadoop.http.HttpServer2: addJerseyResourcePackage: packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.
2017-07-03 12:49:11,597 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 50070
2017-07-03 12:49:11,597 INFO org.mortbay.log: jetty-6.1.26                      
2017-07-03 12:49:11,716 INFO org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@neb1:50070
2017-07-03 12:49:11,740 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due
2017-07-03 12:49:11,740 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one namespace edits storage directory (dfs.namenode.edits.dir) configured. Beware of da
2017-07-03 12:49:11,770 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: No KeyProvider found.
2017-07-03 12:49:11,770 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsLock is fair:true
2017-07-03 12:49:11,803 INFO org.apache.hadoop.util.HostsFileReader: Adding neb2 to the list of included hosts from /tmp/hadoop/j_hbase/749160/conf/hosts-include
2017-07-03 12:49:11,803 INFO org.apache.hadoop.util.HostsFileReader: Adding neb3 to the list of included hosts from /tmp/hadoop/j_hbase/749160/conf/hosts-include
2017-07-03 12:49:11,803 INFO org.apache.hadoop.util.HostsFileReader: Adding neb4 to the list of included hosts from /tmp/hadoop/j_hbase/749160/conf/hosts-include
2017-07-03 12:49:11,803 INFO org.apache.hadoop.util.HostsFileReader: Adding neb5 to the list of included hosts from /tmp/hadoop/j_hbase/749160/conf/hosts-include
2017-07-03 12:49:11,803 INFO org.apache.hadoop.util.HostsFileReader: Adding neb6 to the list of included hosts from /tmp/hadoop/j_hbase/749160/conf/hosts-include
2017-07-03 12:49:11,803 INFO org.apache.hadoop.util.HostsFileReader: Adding neb7 to the list of included hosts from /tmp/hadoop/j_hbase/749160/conf/hosts-include
2017-07-03 12:49:11,816 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
2017-07-03 12:49:11,816 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2017-07-03 12:49:11,817 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2017-07-03 12:49:11,818 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: The block deletion will start around 2017 Jul 03 12:49:11
2017-07-03 12:49:11,821 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlocksMap
2017-07-03 12:49:11,821 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2017-07-03 12:49:11,823 INFO org.apache.hadoop.util.GSet: 2.0% max memory 958.5 MB = 19.2 MB
2017-07-03 12:49:11,823 INFO org.apache.hadoop.util.GSet: capacity      = 2^21 = 2097152 entries
2017-07-03 12:49:11,835 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.block.access.token.enable=false
2017-07-03 12:49:11,835 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: defaultReplication         = 3
2017-07-03 12:49:11,835 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplication             = 512
2017-07-03 12:49:11,835 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: minReplication             = 1
2017-07-03 12:49:11,835 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplicationStreams      = 2
2017-07-03 12:49:11,835 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: replicationRecheckInterval = 3000
2017-07-03 12:49:11,835 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: encryptDataTransfer        = false
2017-07-03 12:49:11,835 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
2017-07-03 12:49:11,843 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner             = jlaura (auth:SIMPLE)
2017-07-03 12:49:11,843 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup          = jlaura
2017-07-03 12:49:11,844 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled = true
2017-07-03 12:49:11,844 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: HA Enabled: false
2017-07-03 12:49:11,845 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Append Enabled: true
2017-07-03 12:49:11,892 INFO org.apache.hadoop.util.GSet: Computing capacity for map INodeMap
2017-07-03 12:49:11,892 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2017-07-03 12:49:11,892 INFO org.apache.hadoop.util.GSet: 1.0% max memory 958.5 MB = 9.6 MB
2017-07-03 12:49:11,892 INFO org.apache.hadoop.util.GSet: capacity      = 2^20 = 1048576 entries
2017-07-03 12:49:11,894 INFO org.apache.hadoop.hdfs.server.namenode.FSDirectory: ACLs enabled? false
2017-07-03 12:49:11,894 INFO org.apache.hadoop.hdfs.server.namenode.FSDirectory: XAttrs enabled? true
2017-07-03 12:49:11,894 INFO org.apache.hadoop.hdfs.server.namenode.FSDirectory: Maximum size of an xattr: 16384
2017-07-03 12:49:11,894 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times
2017-07-03 12:49:11,901 INFO org.apache.hadoop.util.GSet: Computing capacity for map cachedBlocks
2017-07-03 12:49:11,901 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2017-07-03 12:49:11,902 INFO org.apache.hadoop.util.GSet: 0.25% max memory 958.5 MB = 2.4 MB
2017-07-03 12:49:11,902 INFO org.apache.hadoop.util.GSet: capacity      = 2^18 = 262144 entries
2017-07-03 12:49:11,904 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
2017-07-03 12:49:11,904 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
2017-07-03 12:49:11,904 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
2017-07-03 12:49:11,906 INFO org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2017-07-03 12:49:11,907 INFO org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2017-07-03 12:49:11,907 INFO org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2017-07-03 12:49:11,908 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache on namenode is enabled
2017-07-03 12:49:11,908 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2017-07-03 12:49:11,910 INFO org.apache.hadoop.util.GSet: Computing capacity for map NameNodeRetryCache
2017-07-03 12:49:11,910 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2017-07-03 12:49:11,910 INFO org.apache.hadoop.util.GSet: 0.029999999329447746% max memory 958.5 MB = 294.5 KB
2017-07-03 12:49:11,910 INFO org.apache.hadoop.util.GSet: capacity      = 2^15 = 32768 entries
2017-07-03 12:49:11,921 ERROR org.apache.hadoop.hdfs.server.common.Storage: Failed to acquire lock on /scratch/jlaura/hdfsoverlustre/node-0/dfs/name/in_use.lock. If this stor
java.io.IOException: Function not implemented    
chu11 commented 7 years ago

I suspect that the "intellustre" plugin I programmed Magpie to may be different than the plugin you are trying out. Unfortunately, there is some strange lineage there and I think Intel released multiple plugins over the years. Could you try HADOOP_FILESYSTEM_MODE="hdfsoverlustre", b/c this more basic configuration should definitely work (you may want to rm -rf your current hdfsoverlustre directory, to give it a clean start).

jlaura commented 7 years ago

Thanks for the suggestion. I removed the hdfsoverlustre directory on the lustre file system, ensured that the HADOOP_FILESYSTEM_MODE variable was set in the submit script (link to a gist with the script), and attempted to launch a job. The error is the same, with a lock acquisition issue. This is java version 1.8.0 and hadoop 2.7.3. Perhaps I have an incompatibility there?

Log:

STARTUP_MSG:   java = 1.8.0_111
************************************************************/
17/07/05 13:16:35 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
17/07/05 13:16:35 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-b5a22005-58ee-4fc0-ae8d-25996fd64bc6
17/07/05 13:16:35 INFO namenode.FSNamesystem: No KeyProvider found.
17/07/05 13:16:35 INFO namenode.FSNamesystem: fsLock is fair:true
17/07/05 13:16:35 INFO util.HostsFileReader: Adding neb2 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
17/07/05 13:16:35 INFO util.HostsFileReader: Adding neb3 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
17/07/05 13:16:35 INFO util.HostsFileReader: Adding neb4 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
17/07/05 13:16:35 INFO util.HostsFileReader: Adding neb5 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
17/07/05 13:16:35 INFO util.HostsFileReader: Adding neb6 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
17/07/05 13:16:35 INFO util.HostsFileReader: Adding neb7 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
17/07/05 13:16:35 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
17/07/05 13:16:35 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
17/07/05 13:16:35 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
17/07/05 13:16:35 INFO blockmanagement.BlockManager: The block deletion will start around 2017 Jul 05 13:16:35
17/07/05 13:16:35 INFO util.GSet: Computing capacity for map BlocksMap
17/07/05 13:16:35 INFO util.GSet: VM type       = 64-bit
17/07/05 13:16:35 INFO util.GSet: 2.0% max memory 958.5 MB = 19.2 MB
17/07/05 13:16:35 INFO util.GSet: capacity      = 2^21 = 2097152 entries
17/07/05 13:16:35 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
17/07/05 13:16:35 INFO blockmanagement.BlockManager: defaultReplication         = 3
17/07/05 13:16:35 INFO blockmanagement.BlockManager: maxReplication             = 512
17/07/05 13:16:35 INFO blockmanagement.BlockManager: minReplication             = 1
17/07/05 13:16:35 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
17/07/05 13:16:35 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
17/07/05 13:16:35 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
17/07/05 13:16:35 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
17/07/05 13:16:35 INFO namenode.FSNamesystem: fsOwner             = jlaura (auth:SIMPLE)
17/07/05 13:16:35 INFO namenode.FSNamesystem: supergroup          = jlaura
17/07/05 13:16:35 INFO namenode.FSNamesystem: isPermissionEnabled = true
17/07/05 13:16:35 INFO namenode.FSNamesystem: HA Enabled: false
17/07/05 13:16:35 INFO namenode.FSNamesystem: Append Enabled: true
17/07/05 13:16:35 INFO util.GSet: Computing capacity for map INodeMap
17/07/05 13:16:35 INFO util.GSet: VM type       = 64-bit
17/07/05 13:16:35 INFO util.GSet: 1.0% max memory 958.5 MB = 9.6 MB
17/07/05 13:16:35 INFO util.GSet: capacity      = 2^20 = 1048576 entries
17/07/05 13:16:35 INFO namenode.FSDirectory: ACLs enabled? false
17/07/05 13:16:35 INFO namenode.FSDirectory: XAttrs enabled? true
17/07/05 13:16:35 INFO namenode.FSDirectory: Maximum size of an xattr: 16384
17/07/05 13:16:35 INFO namenode.NameNode: Caching file names occuring more than 10 times
17/07/05 13:16:35 INFO util.GSet: Computing capacity for map cachedBlocks
17/07/05 13:16:35 INFO util.GSet: VM type       = 64-bit
17/07/05 13:16:35 INFO util.GSet: 0.25% max memory 958.5 MB = 2.4 MB
17/07/05 13:16:35 INFO util.GSet: capacity      = 2^18 = 262144 entries
17/07/05 13:16:35 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
17/07/05 13:16:35 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
17/07/05 13:16:35 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
17/07/05 13:16:35 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
17/07/05 13:16:35 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
17/07/05 13:16:35 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
17/07/05 13:16:35 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
17/07/05 13:16:35 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
17/07/05 13:16:35 INFO util.GSet: Computing capacity for map NameNodeRetryCache
17/07/05 13:16:35 INFO util.GSet: VM type       = 64-bit
17/07/05 13:16:35 INFO util.GSet: 0.029999999329447746% max memory 958.5 MB = 294.5 KB
17/07/05 13:16:35 INFO util.GSet: capacity      = 2^15 = 32768 entries
17/07/05 13:16:35 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1619985173-172.16.3.201-1499285795952
17/07/05 13:16:35 INFO common.Storage: Storage directory /scratch/jlaura/hdfsoverlustre/node-0/dfs/name has been successfully formatted.
17/07/05 13:16:36 INFO namenode.FSImageFormatProtobuf: Saving image file /scratch/jlaura/hdfsoverlustre/node-0/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
17/07/05 13:16:36 INFO namenode.FSImageFormatProtobuf: Image file /scratch/jlaura/hdfsoverlustre/node-0/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 337 bytes saved in 0 seconds.
17/07/05 13:16:36 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
17/07/05 13:16:36 INFO util.ExitUtil: Exiting with status 0
17/07/05 13:16:36 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at neb1.ib.astrohpc/172.16.3.201
************************************************************/
*******************************************************
* Post Setup Complete
*******************************************************
chu11 commented 7 years ago

There appears to be something more fundamentally wrong going on. Dunno if Java 1.8 is the problem, but couldn't hurt to try with Java 1.7.

Does the namenode log say anything particularly interesting? It'd be stored in something like /tmp/hadoop/jobname/log.

jlaura commented 7 years ago

No joy downgrading to Java 1.7.0. The namenode log is not particularly interesting outside of the error throwing Function not implemented. Copied below:

bash-4.2$ cat /tmp/hadoop/j_hbase/749198/log/hadoop-jlaura-namenode-neb1.log
2017-07-05 13:17:43,565 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = neb1.ib.astrohpc/172.16.3.201
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.7.3
STARTUP_MSG:   classpath = ###TRUNCATED - I can add in a gist.
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff; compiled by 'root' on 2016-08-18T01:41Z
STARTUP_MSG:   java = 1.8.0_111
************************************************************/
2017-07-05 13:17:43,572 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2017-07-05 13:17:43,575 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: createNameNode []
2017-07-05 13:17:43,813 WARN org.apache.hadoop.metrics2.impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-namenode.properties,hadoop-metrics2.properties
2017-07-05 13:17:43,919 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2017-07-05 13:17:43,919 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started
2017-07-05 13:17:43,922 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: fs.defaultFS is hdfs://neb1:54310
2017-07-05 13:17:43,923 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Clients are to use neb1:54310 to access this namenode/service.
2017-07-05 13:17:44,096 INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at: http://neb1:50070
2017-07-05 13:17:44,201 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2017-07-05 13:17:44,210 INFO org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets.
2017-07-05 13:17:44,219 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.namenode is not defined
2017-07-05 13:17:44,241 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2017-07-05 13:17:44,243 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context hdfs
2017-07-05 13:17:44,243 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
2017-07-05 13:17:44,243 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs
2017-07-05 13:17:44,372 INFO org.apache.hadoop.http.HttpServer2: Added filter 'org.apache.hadoop.hdfs.web.AuthFilter' (class=org.apache.hadoop.hdfs.web.AuthFilter)
2017-07-05 13:17:44,373 INFO org.apache.hadoop.http.HttpServer2: addJerseyResourcePackage: packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources, pathSpec=/webhdfs/v1/*
2017-07-05 13:17:44,386 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 50070
2017-07-05 13:17:44,386 INFO org.mortbay.log: jetty-6.1.26
2017-07-05 13:17:44,547 INFO org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@neb1:50070
2017-07-05 13:17:44,575 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage directories!
2017-07-05 13:17:44,575 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one namespace edits storage directory (dfs.namenode.edits.dir) configured. Beware of data loss due to lack of redundant storage directories!
2017-07-05 13:17:44,614 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: No KeyProvider found.
2017-07-05 13:17:44,614 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsLock is fair:true
2017-07-05 13:17:44,649 INFO org.apache.hadoop.util.HostsFileReader: Adding neb2 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
2017-07-05 13:17:44,649 INFO org.apache.hadoop.util.HostsFileReader: Adding neb3 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
2017-07-05 13:17:44,649 INFO org.apache.hadoop.util.HostsFileReader: Adding neb4 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
2017-07-05 13:17:44,650 INFO org.apache.hadoop.util.HostsFileReader: Adding neb5 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
2017-07-05 13:17:44,650 INFO org.apache.hadoop.util.HostsFileReader: Adding neb6 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
2017-07-05 13:17:44,650 INFO org.apache.hadoop.util.HostsFileReader: Adding neb7 to the list of included hosts from /tmp/hadoop/j_hbase/749198/conf/hosts-include
2017-07-05 13:17:44,661 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
2017-07-05 13:17:44,661 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2017-07-05 13:17:44,662 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2017-07-05 13:17:44,662 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: The block deletion will start around 2017 Jul 05 13:17:44
2017-07-05 13:17:44,664 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlocksMap
2017-07-05 13:17:44,664 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2017-07-05 13:17:44,665 INFO org.apache.hadoop.util.GSet: 2.0% max memory 958.5 MB = 19.2 MB
2017-07-05 13:17:44,665 INFO org.apache.hadoop.util.GSet: capacity      = 2^21 = 2097152 entries
2017-07-05 13:17:44,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.block.access.token.enable=false
2017-07-05 13:17:44,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: defaultReplication         = 3
2017-07-05 13:17:44,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplication             = 512
2017-07-05 13:17:44,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: minReplication             = 1
2017-07-05 13:17:44,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplicationStreams      = 2
2017-07-05 13:17:44,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: replicationRecheckInterval = 3000
2017-07-05 13:17:44,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: encryptDataTransfer        = false
2017-07-05 13:17:44,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
2017-07-05 13:17:44,680 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner             = jlaura (auth:SIMPLE)
2017-07-05 13:17:44,680 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup          = jlaura
2017-07-05 13:17:44,680 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled = true
2017-07-05 13:17:44,680 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: HA Enabled: false
2017-07-05 13:17:44,681 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Append Enabled: true
2017-07-05 13:17:44,720 INFO org.apache.hadoop.util.GSet: Computing capacity for map INodeMap
2017-07-05 13:17:44,721 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2017-07-05 13:17:44,721 INFO org.apache.hadoop.util.GSet: 1.0% max memory 958.5 MB = 9.6 MB
2017-07-05 13:17:44,721 INFO org.apache.hadoop.util.GSet: capacity      = 2^20 = 1048576 entries
2017-07-05 13:17:44,722 INFO org.apache.hadoop.hdfs.server.namenode.FSDirectory: ACLs enabled? false
2017-07-05 13:17:44,723 INFO org.apache.hadoop.hdfs.server.namenode.FSDirectory: XAttrs enabled? true
2017-07-05 13:17:44,723 INFO org.apache.hadoop.hdfs.server.namenode.FSDirectory: Maximum size of an xattr: 16384
2017-07-05 13:17:44,723 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times
2017-07-05 13:17:44,729 INFO org.apache.hadoop.util.GSet: Computing capacity for map cachedBlocks
2017-07-05 13:17:44,729 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2017-07-05 13:17:44,729 INFO org.apache.hadoop.util.GSet: 0.25% max memory 958.5 MB = 2.4 MB
2017-07-05 13:17:44,729 INFO org.apache.hadoop.util.GSet: capacity      = 2^18 = 262144 entries
2017-07-05 13:17:44,731 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
2017-07-05 13:17:44,731 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
2017-07-05 13:17:44,731 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
2017-07-05 13:17:44,734 INFO org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2017-07-05 13:17:44,734 INFO org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2017-07-05 13:17:44,734 INFO org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2017-07-05 13:17:44,735 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache on namenode is enabled
2017-07-05 13:17:44,736 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2017-07-05 13:17:44,737 INFO org.apache.hadoop.util.GSet: Computing capacity for map NameNodeRetryCache
2017-07-05 13:17:44,737 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2017-07-05 13:17:44,737 INFO org.apache.hadoop.util.GSet: 0.029999999329447746% max memory 958.5 MB = 294.5 KB
2017-07-05 13:17:44,737 INFO org.apache.hadoop.util.GSet: capacity      = 2^15 = 32768 entries
2017-07-05 13:17:44,748 ERROR org.apache.hadoop.hdfs.server.common.Storage: Failed to acquire lock on /scratch/jlaura/hdfsoverlustre/node-0/dfs/name/in_use.lock. If this storage directory is mounted via NFS, ensure that the appropriate nfs lock services are running.
java.io.IOException: Function not implemented
    at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90)
    at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115)
    at java.nio.channels.FileChannel.tryLock(FileChannel.java:1155)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:711)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:679)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:499)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:322)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:215)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:812)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:796)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1493)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1559)
2017-07-05 13:17:44,750 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage
java.io.IOException: Function not implemented
    at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90)
    at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115)
    at java.nio.channels.FileChannel.tryLock(FileChannel.java:1155)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:711)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:679)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:499)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:322)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:215)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:812)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:796)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1493)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1559)
2017-07-05 13:17:44,752 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@neb1:50070
2017-07-05 13:17:44,852 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
2017-07-05 13:17:44,853 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2017-07-05 13:17:44,853 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2017-07-05 13:17:44,853 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.io.IOException: Function not implemented
    at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90)
    at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115)
    at java.nio.channels.FileChannel.tryLock(FileChannel.java:1155)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:711)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:679)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:499)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:322)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:215)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:812)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:796)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1493)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1559)
2017-07-05 13:17:44,856 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-07-05 13:17:44,859 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at neb1.ib.astrohpc/xxx.xx.xx.xxx
************************************************************/
chu11 commented 7 years ago

Just to make sure, have you removed the intel lustre plugin jar completely? By nothing more than dumb luck, I found this stackoverflow where someone was having problems with that plugin and got the same error message. https://stackoverflow.com/questions/43392133/unable-to-start-name-node-while-configuring-hadoop-for-lustre/43392558#43392558 Best to axe that plugin if it's doing something behind the scenes that we're not aware of.

My java-fu is weak, but Googling around it appears there's something subtle with some NFS/Lustre implementations not implementing some underlying locking mechanism that sun.nio.ch.FileDispatcherImpl.lock0 needs. Do you know what version of Lustre you are using? I'm at 2.5 and began developing Magpie on 2.2. I think /proc/fs/lustre/versions can give you the version number.

Unclear if some locking mechanism stuff in Lustre is configurable. I'd have to ping some far more knowledgeable people in that regard.

jlaura commented 7 years ago

@chu11 - our dev ops person and I are finally back in the building at the same time. Another question about your setup - do you have the lustre mounted with locking? If so, full flock or local flock?

We are on lustre 2.8, so not so far ahead and nothing leaped out of the Lustre change logs to indicate a major change to locking.

Thanks! J

chu11 commented 7 years ago

For atleast my test system, flock appears to be enabled. I didn't think about whether or not the filesystem itself had locking enabled, but I could see that being an issue.