NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

Workflow throughput falls when submitting ~50 of them #29

Open jameshcorbett opened 1 year ago

jameshcorbett commented 1 year ago

Very general issue. Since the Jan 26th switch firmware update, we have not managed to crash the switch. However, when submitting ~50 workflows at a time, there is an enormous drop in performance. Most of the workflows are stuck in DriverWait, sometimes for ten minutes at a time. They seem to progress in fits and starts.

NateThornton commented 1 year ago

I looked into this and as far as the Setup phase is concerned, there are a few optimizations when can make. One is a trivial change and improves the setup performance slightly (~10%), the other is a bit more complex but would improve setup performance greatly. I'll work with Matt on the necessary changes in the coming days.

behlendorf commented 1 year ago

Setup timings for a series of 50 workflows which were scheduled to 2 computes as they became available. Each job specified a single xfs filesystem, #DW jobdw capacity=1TiB type=xfs name=test1. It took about 24 minutes to work through all of the workflows and the setup times ranges from as low as 9 seconds to as high as146 seconds.

NAME                         STATE      READY   STATUS       AGE      DESIREDSTATE   DESIREDSTATECHANGE
fluxjob-173130553699075072   Setup      true    Completed    15s      Setup          11s       
fluxjob-173130559705318400   Setup      true    Completed    26s      Setup          22s               
fluxjob-173130565610898432   Setup      true    Completed    2m10s    Setup          73s               
fluxjob-173130571550032896   Setup      true    Completed    2m19s    Setup          84s               
fluxjob-173130583361192960   Setup      true    Completed    3m12s    Setup          14s               
fluxjob-173130577438835712   Setup      true    Completed    3m20s    Setup          23s               
fluxjob-173130593628849152   Setup      true    Completed    4m20s    Setup          9s
fluxjob-173130599601538048   Setup      true    Completed    4m30s    Setup          20s
fluxjob-173130611479806976   Setup      true    Completed    5m29s    Setup          12s
fluxjob-173130605557449728   Setup      true    Completed    5m39s    Setup          23s
fluxjob-173130617435718656   Setup      true    Completed    6m16s    Setup          12s
fluxjob-173130623475516416   Setup      true    Completed    7m28s    Setup          84s
fluxjob-173130629414650880   Setup      true    Completed    7m59s    Setup          11s
fluxjob-173130635337008128   Setup      true    Completed    8m10s    Setup          22s
fluxjob-173130641292919808   Setup      true    Completed    9m38s    Setup          64s
fluxjob-173130647181722624   Setup      true    Completed    9m43s    Setup          70s
fluxjob-173130653120857088   Setup      true    Completed    11m      Setup          65s
fluxjob-173130659043214336   Setup      true    Completed    11m      Setup          74s
fluxjob-173130664915239936   Setup      true    Completed    11m      Setup          15s
fluxjob-173131106910995456   Setup      true    Completed    11m      Setup          23s
fluxjob-173131451917665280   Setup      true    Completed    11m      Setup          14s
fluxjob-173131523589932032   Setup      true    Completed    12m      Setup          22s
fluxjob-173131680154911744   Setup      true    Completed    15m      Setup          2m8s              
fluxjob-173131599590720512   Setup      true    Completed    15m      Setup          2m15s             
fluxjob-173131765332837376   Setup      true    Completed    15m      Setup          11s               
fluxjob-173131855174829056   Setup      true    Completed    15m      Setup          21s               
fluxjob-173131949714441216   Setup      true    Completed    16m      Setup          10s               
fluxjob-173132049052337152   Setup      true    Completed    16m      Setup          20s               
fluxjob-173132153154962432   Setup      true    Completed    16m      Setup          9s                
fluxjob-173132262022317056   Setup      true    Completed    16m      Setup          19s               
fluxjob-173132375788618752   Setup      true    Completed    17m      Setup          10s               
fluxjob-173132494621639680   Setup      true    Completed    17m      Setup          20s               
fluxjob-173132618135503872   Setup      true    Completed    17m      Setup          11s               
fluxjob-173132745491350528   Setup      true    Completed    17m      Setup          22s               
fluxjob-173132878769554432   Setup      true    Completed    17m      Setup          11s               
fluxjob-173133017617794048   Setup      true    Completed    17m      Setup          22s               
fluxjob-173133438390371328   Setup      true    Completed    17m      Setup          10s               
fluxjob-173133580594054144   Setup      true    Completed    17m      Setup          19s               
fluxjob-173133727998673920   Setup      true    Completed    18m      Setup          16s               
fluxjob-173133880855888896   Setup      true    Completed    18m      Setup          24s               
fluxjob-173134034568741888   Setup      true    Completed    18m      Setup          17s               
fluxjob-173134188566807552   Setup      true    Completed    18m      Setup          24s               
fluxjob-173134643195806720   Setup      true    Completed    20m      Setup          79s               
fluxjob-173134800398320640   Setup      true    Completed    21m      Setup          2m27s             
fluxjob-173134957433062400   Setup      true    Completed    23m      Setup          2m21s             
fluxjob-173135115608654848   Setup      true    Completed    23m      Setup          2m26s             
fluxjob-173135273532589056   Setup      true    Completed    23m      Setup          15s               
fluxjob-173135743143642112   Setup      true    Completed    23m      Setup          26s               
fluxjob-173135905664533504   Setup      true    Completed    24m      Setup          74s               
fluxjob-173136068437083136   Setup      true    Completed    24m      Setup          86s       
ajfloeder commented 1 year ago

A few questions about the behavior of flux in the workflows here:

  1. Are both computes assigned to each worklow such that the workflows serialize on the computes at pre_run?
  2. When does flux progress a workflow to setup, immediately after proposal, or after the computes for that workflow become available, or some other timing?
  3. After post_run, does flux attempt to drive the workflow through data_out and teardown as quickly as possible, such that workflow n is going through the teardown while workflow n+1 is progressing to setup?
behlendorf commented 1 year ago

Are both computes assigned to each worklow such that the workflows serialize on the computes at pre_run?

For the results above each workflow requests a single compute.

When does flux progress a workflow to setup, immediately after proposal, or after the computes for that workflow become available, or some other timing?

The workflow is progressed to setup when the compute is scheduled.

After post_run, does flux attempt to drive the workflow through data_out and teardown as quickly as possible, such that workflow n is going through the teardown while workflow n+1 is progressing to setup?

Currently Flux will progress the workflow through post_run and teardown as quickly as possible. However, once the compute has been released a new workflow can progress through setup while the previous workflow is still progressing to teardown.

matthew-richerson commented 1 year ago

Were there any other allocations on the system other than the 50 you made?

behlendorf commented 1 year ago

These were the only allocations on the system for data in the https://github.com/NearNodeFlash/NearNodeFlash.github.io/issues/29#issuecomment-1435419842 above.

behlendorf commented 1 year ago

Setup and Teardown timings running with https://github.com/NearNodeFlash/nnf-ec/pull/69

fluxjob-187272966025774080   Setup      true    Completed    18s    187272966025774080   Setup     9s    9.188051s
fluxjob-187272975001584640   Setup      true    Completed    27s    187272975001584640   Setup     18s   18.016093s
fluxjob-187272983994172416   Setup      true    Completed    2m19s  187272983994172416   Setup     66s   1m6.388256s
fluxjob-187273408206078976   Setup      true    Completed    11m    187273408206078976   Setup     7s    7.891774s
fluxjob-187273500581430272   Setup      true    Completed    12m    187273500581430272   Setup     8s    8.616216s
fluxjob-187273602687566848   Setup      true    Completed    13m    187273602687566848   Setup     8s    8.466047s
fluxjob-187273714641929216   Setup      true    Completed    13m    187273714641929216   Setup     14s   14.052869s
fluxjob-187273836545180672   Setup      true    Completed    14m    187273836545180672   Setup     9s    9.311961s
fluxjob-187274069631042560   Setup      true    Completed    14m    187274069631042560   Setup     17s   17.070248s
fluxjob-187274201550291968   Setup      true    Completed    15m    187274201550291968   Setup     8s    8.256352s
fluxjob-187274343032554496   Setup      true    Completed    15m    187274343032554496   Setup     16s   16.914862s
fluxjob-187274494463706112   Setup      true    Completed    15m    187274494463706112   Setup     7s    7.883123s
fluxjob-187274656212845568   Setup      true    Completed    15m    187274656212845568   Setup     6s    6.13419s
fluxjob-187274828028314624   Setup      true    Completed    16m    187274828028314624   Setup     59s   59.901325s
fluxjob-187275009725563904   Setup      true    Completed    16m    187275009725563904   Setup     46s   46.498988s
fluxjob-187272993020314624   Setup      true    Completed    3m36s  187272993020314624   Setup     6s    6.659889s
fluxjob-187275201623360512   Setup      true    Completed    18m    187275201623360512   Setup     31s   31.452478s
fluxjob-187275403923031040   Setup      true    Completed    18m    187275403923031040   Setup     36s   36.757735s
fluxjob-187275615970264064   Setup      true    Completed    19m    187275615970264064   Setup     29s   29.95692s
fluxjob-187275838603920384   Setup      true    Completed    19m    187275838603920384   Setup     36s   36.57097s
fluxjob-187276274123670528   Setup      true    Completed    19m    187276274123670528   Setup     34s   34.818056s
fluxjob-187276507696071680   Setup      true    Completed    19m    187276507696071680   Setup     40s   40.170695s
fluxjob-187276751368356864   Setup      true    Completed    20m    187276751368356864   Setup     7s    7.181522s
fluxjob-187277005526402048   Setup      true    Completed    20m    187277005526402048   Setup     12s   12.798472s
fluxjob-187277524361806848   Setup      true    Completed    20m    187277524361806848   Setup     8s    8.36705s
fluxjob-187278045059482624   Setup      true    Completed    20m    187278045059482624   Setup     15s   15.104597s
fluxjob-187278311213237248   Setup      true    Completed    20m    187278311213237248   Setup     7s    7.368479s
fluxjob-187278588523840512   Setup      true    Completed    20m    187278588523840512   Setup     13s   13.480226s
fluxjob-187279154117346304   Setup      true    Completed    22m    187279154117346304   Setup     104s  1m44.442539s
fluxjob-187279443021005824   Setup      true    Completed    22m    187279443021005824   Setup     6s    6.400543s
fluxjob-187279743064736768   Setup      true    Completed    22m    187279743064736768   Setup     8s    8.352269s
fluxjob-187280054063989760   Setup      true    Completed    24m    187280054063989760   Setup     85s   1m25.055972s
fluxjob-187280989729326080   Setup      true    Completed    24m    187280989729326080   Setup     7s    7.544734s
fluxjob-187273002029679616   Setup      true    Completed    5m8s   187273002029679616   Setup     6s    6.350688s
fluxjob-187281303731700736   Setup      true    Completed    25m    187281303731700736   Setup     8s    8.634462s
fluxjob-187281628119172096   Setup      true    Completed    28m    187281628119172096   Setup     2m51s 2m51.277786s
fluxjob-187281964284249088   Setup      true    Completed    28m    187281964284249088   Setup     8s    8.928348s
fluxjob-187282627370157056   Setup      true    Completed    28m    187282627370157056   Setup     8s    8.505554s
fluxjob-187283293878617088   Setup      true    Completed    28m    187283293878617088   Setup     52s   52.739574s
fluxjob-187283633952785408   Setup      true    Completed    29m    187283633952785408   Setup     8s    8.238352s
fluxjob-187283974563824640   Setup      true    Completed    29m    187283974563824640   Setup     7s    7.707988s
fluxjob-187284646407439360   Setup      true    Completed    29m    187284646407439360   Setup     6s    6.299939s
fluxjob-187285305097716736   Setup      true    Completed    28m    187285305097716736   Setup     6s    6.333515s
fluxjob-187285641715778560   Setup      true    Completed    28m    187285641715778560   Setup     7s    7.048827s
fluxjob-187273011055821824   Setup      true    Completed    6m24s  187273011055821824   Setup     68s   1m8.809988s
fluxjob-187273020048409600   Setup      true    Completed    6m30s  187273020048409600   Setup     8s    8.682105s
fluxjob-187273091267691520   Setup      true    Completed    8m14s  187273091267691520   Setup     6s    6.441196s
fluxjob-187273100277056512   Setup      true    Completed    8m20s  187273100277056512   Setup     12s   12.478331s
fluxjob-187273109286421504   Setup      true    Completed    10m    187273109286421504   Setup     9s    9.101292s
fluxjob-187273200688694272   Setup      true    Completed    10m    187273200688694272   Setup     6s    6.243909s
fluxjob-187272966025774080   Teardown   true    Completed    38s    187272966025774080   Teardown  5s    3.375206s
fluxjob-187272975001584640   Teardown   true    Completed    38s    187272975001584640   Teardown  5s    3.387917s
fluxjob-187272983994172416   Teardown   true    Completed    2m32s  187272983994172416   Teardown  5s    1.889071s
fluxjob-187272993020314624   Teardown   true    Completed    3m47s  187272993020314624   Teardown  5s    1.827056s
fluxjob-187273002029679616   Teardown   true    Completed    5m21s  187273002029679616   Teardown  5s    1.860114s
fluxjob-187273011055821824   Teardown   true    Completed    6m44s  187273011055821824   Teardown  5s    4.126638s
fluxjob-187273020048409600   Teardown   true    Completed    6m44s  187273020048409600   Teardown  5s    4.112023s
fluxjob-187273091267691520   Teardown   true    Completed    8m24s  187273091267691520   Teardown  5s    2.869071s
fluxjob-187273100277056512   Teardown   true    Completed    8m44s  187273100277056512   Teardown  5s    1.83522s
fluxjob-187273109286421504   Teardown   true    Completed    10m    187273109286421504   Teardown  5s    2.551312s
fluxjob-187273200688694272   Teardown   true    Completed    10m    187273200688694272   Teardown  5s    2.314873s
fluxjob-187273408206078976   Teardown   true    Completed    12m    187273408206078976   Teardown  5s    1.670701s
fluxjob-187273500581430272   Teardown   true    Completed    12m    187273500581430272   Teardown  5s    1.825884s
fluxjob-187273602687566848   Teardown   true    Completed    13m    187273602687566848   Teardown  5s    2.027782s
fluxjob-187273714641929216   Teardown   true    Completed    13m    187273714641929216   Teardown  5s    1.987517s
fluxjob-187273836545180672   Teardown   true    Completed    14m    187273836545180672   Teardown  4s    4.514043s
fluxjob-187274069631042560   Teardown   true    Completed    14m    187274069631042560   Teardown  4s    4.615079s
fluxjob-187274201550291968   Teardown   true    Completed    15m    187274201550291968   Teardown  5s    5.478054s
fluxjob-187274343032554496   Teardown   true    Completed    15m    187274343032554496   Teardown  5s    2.747866s
fluxjob-187274494463706112   Teardown   true    Completed    15m    187274494463706112   Teardown  5s    1.691979s
fluxjob-187274656212845568   Teardown   true    Completed    17m    187274656212845568   Teardown  65s   1m1.896541s
fluxjob-187274828028314624   Teardown   true    Completed    18m    187274828028314624   Teardown  65s   1m2.67086s
fluxjob-187275009725563904   Teardown   true    Completed    17m    187275009725563904   Teardown  5s    1.828701s
fluxjob-187275201623360512   Teardown   true    Completed    19m    187275201623360512   Teardown  60s   59.949363s
fluxjob-187275403923031040   Teardown   true    Completed    19m    187275403923031040   Teardown  65s   1m4.943822s
fluxjob-187275615970264064   Teardown   true    Completed    20m    187275615970264064   Teardown  65s   1m4.13842s
fluxjob-187275838603920384   Teardown   true    Completed    19m    187275838603920384   Teardown  5s    1.90792s
fluxjob-187276274123670528   Teardown   true    Completed    20m    187276274123670528   Teardown  5s    1.81926s
fluxjob-187276507696071680   Teardown   true    Completed    20m    187276507696071680   Teardown  5s    1.749898s
fluxjob-187276751368356864   Teardown   true    Completed    20m    187276751368356864   Teardown  5s    3.627666s
fluxjob-187277005526402048   Teardown   true    Completed    20m    187277005526402048   Teardown  5s    3.642023s
fluxjob-187277524361806848   Teardown   true    Completed    21m    187277524361806848   Teardown  5s    4.753855s
fluxjob-187278045059482624   Teardown   true    Completed    20m    187278045059482624   Teardown  5s    4.740378s
fluxjob-187278311213237248   Teardown   true    Completed    23m    187278311213237248   Teardown  2m    1m57.282201s
fluxjob-187279154117346304   Teardown   true    Completed    22m    187279154117346304   Teardown  5s    1.870114s
fluxjob-187279443021005824   Teardown   true    Completed    22m    187279443021005824   Teardown  5s    2.049779s
fluxjob-187279743064736768   Teardown   true    Completed    25m    187279743064736768   Teardown  2m5s  2m3.19234s
fluxjob-187280054063989760   Teardown   true    Completed    25m    187280054063989760   Teardown  5s    1.977163s
fluxjob-187280989729326080   Teardown   true    Completed    25m    187280989729326080   Teardown  45s   42.970916s
fluxjob-187281303731700736   Teardown   true    Completed    26m    187281303731700736   Teardown  65s   1m2.157369s
fluxjob-187281628119172096   Teardown   true    Completed    28m    187281628119172096   Teardown  5s    1.674247s
fluxjob-187281964284249088   Teardown   true    Completed    28m    187281964284249088   Teardown  5s    2.040205s
fluxjob-187282627370157056   Teardown   true    Completed    29m    187282627370157056   Teardown  64s   1m2.171206s
fluxjob-187283293878617088   Teardown   true    Completed    29m    187283293878617088   Teardown  5s    2.090346s
fluxjob-187283633952785408   Teardown   true    Completed    29m    187283633952785408   Teardown  5s    1.864345s
fluxjob-187283974563824640   Teardown   true    Completed    29m    187283974563824640   Teardown  5s    1.809296s
fluxjob-187284646407439360   Teardown   true    Completed    29m    187284646407439360   Teardown  5s    1.995767s
fluxjob-187285305097716736   Teardown   true    Completed    28m    187285305097716736   Teardown  5s    1.844718s
fluxjob-187285641715778560   Teardown   true    Completed    29m    187285641715778560   Teardown  5s    1.835342s
matthew-richerson commented 1 year ago

Thanks, Brian. That latest fix seems to show an improvement. There are more setup and teardown times in the single digit seconds. I checked the logs from the system. It looks like the major outliers (~60 seconds+) are caused by a timeout error from the switch. We're able to see the same thing internally, and we're investigating.