GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
584 stars 91 forks source link

improve catalog-web auto scale to handle crashed instances #4615

Open FuhuXia opened 6 months ago

FuhuXia commented 6 months ago

When crashed instances are present, auto scale script does not handle it well. Changes can be done to:

How to reproduce

https://github.com/GSA/catalog.data.gov/actions/runs/7877938597/job/21495120826

Running command: datagov/bin/check-and-renew catalog-web scale
No job running for app catalog-web
Current total instances: 5
bc: divide by zero
bc: bad token at '> 320'
datagov/bin/scale_calculate.sh: line 34: [: : integer expression expected
bc: bad token at '< 250'
datagov/bin/scale_calculate.sh: line 37: [: : integer expression expected
Average CPU is . Just Right.
Remain at the same scale level 5
Scaling catalog-web to 5
Scaling app catalog-web in org gsa-datagov / space prod as ***...
...
Showing current scale of app catalog-web in org gsa-datagov / space prod as ***...

name:              catalog-web
requested state:   started
routes:            catalog-prod-datagov.apps.internal
last uploaded:     Mon 12 Feb 16:42:46 UTC 2024
stack:             cflinuxfs4
buildpacks:        
    name                                            version   detect output   buildpack name
    https://github.com/cloudfoundry/apt-buildpack   0.3.4                     apt
    python_buildpack                                1.8.19    python          python

type:           web
sidecars:       
instances:      1/5
memory usage:   850M
     state     since                  cpu      memory           disk           logging               details
#0   crashed   2024-02-12T21:05:11Z   0.0%     0 of 0           0 of 0         0/s of 0/s            
#1   crashed   2024-02-12T21:02:06Z   0.0%     0 of 0           0 of 0         0/s of 0/s            
#2   crashed   2024-02-12T21:02:35Z   0.0%     0 of 0           0 of 0         0/s of 0/s            
#3   crashed   2024-02-12T21:05:22Z   0.0%     0 of 0           0 of 0         0/s of 0/s            
#4   running   2024-02-12T21:09:18Z   229.2%   550.6M of 850M   861.6M of 2G   264B/s of unlimited   

Expected behavior

Actual behavior

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

FuhuXia commented 6 months ago

it gives error in https://github.com/GSA/catalog.data.gov/actions/runs/7910812915/job/21593986352

...
No job running for app catalog-web
Total instances is not a number. Exiting. Here is the output of the app status:
Showing health and status for app catalog-web in org gsa-datagov / space prod as ***...

name:              catalog-web
requested state:   started
routes:            catalog-prod-datagov.apps.internal
last uploaded:     Wed 14 Feb 19:43:24 UTC 2024
stack:             cflinuxfs4
buildpacks:        
    name                                            version   detect output   buildpack name
    https://github.com/cloudfoundry/apt-buildpack   0.3.4                     apt
    python_buildpack                                1.8.19    python          python

type:           web
sidecars:       
instances:      5/5
memory usage:   850M
     state     since                  cpu      memory           disk           logging               details
#0   running   2024-02-15T03:20:[29](https://github.com/GSA/catalog.data.gov/actions/runs/7910812915/job/21593986352#step:4:30)Z   123.0%   766.8M of 850M   865.8M of 2G   288B/s of unlimited   
#1   running   2024-02-15T03:21:06Z   187.6%   753M of 850M     865.8M of 2G   421B/s of unlimited   
#2   running   2024-02-15T03:21:41Z   159.3%   757.4M of 850M   865.8M of 2G   1.6K/s of unlimited   
#3   running   2024-02-15T03:22:14Z   240.4%   742.8M of 850M   865.8M of 2G   [31](https://github.com/GSA/catalog.data.gov/actions/runs/7910812915/job/21593986352#step:4:32)6B/s of unlimited   
#4   running   2024-02-15T03:22:48Z   181.1%   7[40](https://github.com/GSA/catalog.data.gov/actions/runs/7910812915/job/21593986352#step:4:41)M of 850M     865.8M of 2G   202B/s of unlimited   

type:           web
sidecars:       
instances:      0/1
memory usage:   850M
     state   since                  cpu    memory   disk     logging      details
#0   down    2024-02-15T03:[45](https://github.com/GSA/catalog.data.gov/actions/runs/7910812915/job/21593986352#step:4:46):07Z   0.0%   0 of 0   0 of 0   0/s of 0/s

+++++++++++++++++++++++++++ [update] this is an error by design. we checked there is no other deployment before start scale, but during the split second between after the check and before the scaling, some other deployment might still happen. this error reports it and stop the operation.