douban / pymesos

A pure python implementation of Mesos scheduler and executor
BSD 3-Clause "New" or "Revised" License
163 stars 88 forks source link

Ability to run 10,000 tasks #128

Open mulongfu opened 4 years ago

mulongfu commented 4 years ago

Hi,

When I use pymesos to run 10, 100, 1000 tasks at same time, it runs perfectly. However, for 10000 tasks at same time, some status of tasks are TASK_LOST.

I'm not sure the problem is pymesos or the setting I set.

Mesos Version: 1.9.0 Pymesos: git clone the latest (2020/6/9) Total CPU 412, MEM 5.2TB, Disk 983.9 For one task, it needs 0.01 cpu, 1M mem

For the task starts is TASK_LOST, The mesos master shows: Sending status update TASK_LOST for task task-xx of framework xxx 'Task launched with inva lid offers: Offer xxx is no longer valid'

I guess the cause is that two or above tasks use the same offer id. When one of these tasks finished, the offer will release, and the other task using same offer id cannot use this offer anymore.

ja8zyjits commented 3 years ago

We recently had this issue.

We could find a small co-relation that when cpu usage of the scheduler goes high i.e 50-60%(Verified via docker stats) the invalid offer issue shoots up. Our scheduler runs on docker with 1 cpu and 1 gb ram.

We isolated high cpu usage code block, we re-formated them or moved to an async framework like celery. Most of these function were to communicate with external Micro-services and not with mesos; hence it was safe to reformat them.

Soon the issue was not visible. can you give it a try?