marathon cannot be accessed and takes up mesos resources

bio-guoda / guoda-services

Services provided by GUODA, currently a container for tickets and wikis.

MIT License

2 stars 0 forks source link

marathon cannot be accessed and takes up mesos resources #73

Closed jhpoelen closed 5 years ago

jhpoelen commented 5 years ago

to reproduce:

visit mesos02.acis.ufl.edu:8080

expected result: marathon management ui

actual result:

HTTP ERROR: 503

Problem accessing /. Reason:

    Could not determine the current leader

Powered by Jetty:// 9.3.z-SNAPSHOT

Also, note that marathon framework is taking up an unexpected amount of resources (i.e. 156 cpu; 271GB memory), for running a single task, the spark job dispatcher:

Active tasks:
    1
CPUs:
    156
GPUs:
    0
Mem:
    271.2 GB
Disk:
    8861.4 GB

jhpoelen commented 5 years ago

current state of zookeeper -

$ echo dump | nc mesos02 2181                                                    
SessionTracker dump:                                                                                        
org.apache.zookeeper.server.quorum.LearnerSessionTracker@2a955f14                                           
ephemeral nodes dump:                                                                                       
Sessions with Ephemerals (10):                                                                              
0x164c477c627fb13:                                                                                          
0x696f4d2d145520:                                                                                           
        /marathon/leader/member_0000000210                                                                  
        /marathon/leader/member_0000000209                                                                  
0x2696f4d2daf611b:                                                                                          
        /marathon/leader/member_0000000215                                                                  
0x696f4d2d140001:                                                                                           
        /mesos/log_replicas/0000001784                                                                      
0x696f4d2d140002:                                                                                           
        /mesos/json.info_0000001777                                                                         
0x2696f4d2daf03b9:                                                                                          
        /mesos/json.info_0000001779                                                                         
0x2696f4d2daf03b8:                                                                                          
        /mesos/log_replicas/0000001786                                                                      
0x696f4d2d140007:                                                                                           
        /mesos/json.info_0000001778                                                                         
0x696f4d2d140008:                                                                                           
        /mesos/log_replicas/0000001785                                                                      
0x16972fdc35e5d11:                                                                                          
        /hadoop-ha/hdfscluster/ActiveStandbyElectorLock

jhpoelen commented 5 years ago

I managed to remove marathon state / framework from mesos by:

sending request.txt with:

frameworkId=[marathon framework id]

to mesos using:

curl -d@request.txt -X POST http://mesos02:5050/master/teardown

After executing the command, the marathon framework was no longer listed and the resources were available again. Manual steps are needed to re-register marathon when needed. To launch spark jobs, you have to use spark shell with explicit zookeeper configuration, like:

/opt/spark/latest/bin/spark-shell --master mesos://zk://mesos01:2181,mesos02:2181,mesos03:2181/mesos --drive
r-memory 4G --executor-memory 4G #--total-executor-cores 4

fyi @mielliott @roncanepa

jhpoelen commented 5 years ago

Marathon is no longer actively used. Closing issue.