Python being slower than usual or execution paused for 2 hours

gimenete commented 4 years ago

Describe the bug

A customer reported that in a workflow, a step takes much longer time to run now than in the past, Jan 2020, when this step just took about 20 min to run, but now it will take more than one hour.

cc @BrightRan

Associated ticket: https://github.community/t5/GitHub-Actions/Python-workflow-runs-much-longer-now-than-in-Jan-27-2020/td-p/49428

Area for Triage:

Question, Bug, or Feature?:

Virtual environments affected

[ ] macOS 10.15
[ ] Ubuntu 16.04 LTS
[x] Ubuntu 18.04 LTS
[ ] Windows Server 2016 R2
[ ] Windows Server 2019

Expected behavior

Same speed than before.

Actual behavior

If we take a closer look at the later log, we noticed that the execution had been paused for 2 hours between the following actions:

2020-03-03T17:56:19.4378710Z codalab_worker_1
2020-03-03T19:04:26.1814197Z [CodaLab] Running tests

nikita-bykov commented 4 years ago

Hello @gimenete, @BrightRan! We double checked that there were no changes in the images that may affect the customer's build. We also used the same commit that the customer used in Jan.27.2020 for build and we got the same result as the client (the same time of the build, about 20 minutes) Probably the issue may be related to user's changes in repository and not to changes in our images.

candicegjing commented 4 years ago

Hello @nikita-bykov, thank you very much for helping us investigate this issue. We double-checked our commits after Jan.27.2020, they shouldn't lead the build paused for 2 hours in the middle. We currently have our build on Travis, and it has been stable around 30 minutes.

One thing I noticed that when I was playing with Git Action platform is that if I build the docker images from local and follow the rest of the other steps, it can still fall into the 20 mins time window. However, if I pull the docker images and run the rest of the test instead of building images locally, it will randomly pause for some time during the middle of the tests (not in the image pulling stage). Not sure if this information would be helpful though.

candicegjing commented 4 years ago

Actually, I just reverted our code to the commit on Jan.27.2020 at https://github.com/candicegjing/codalab-worksheets/runs/518107689, it runs for more than 1 hour now. I think this issue still persists in our repo. Please let us know if we can provide any information that would be helpful to investigate this issue.

al-cheb commented 4 years ago

Hello, @candicegjing After some investigation I have reproduced the issue on clean Azure Ubuntu 18.04 vm without any pre-installed software .

Process freeze on the stage in various places: python3 codalab_service.py test --version master default

process: >> cl info -f state 0x4184042dde0f4b4788df617b80e67ef9
process:  (exit code 0, expected 0)
process: 
process: running
process: 
process: >> cl kill 0x4184042dde0f4b4788df617b80e67ef9
process:  (exit code 0, expected 0)
process: 
process: 0x4184042dde0f4b4788df617b80e67ef9
process: 
process: >> cl wait 0x4184042dde0f4b4788df617b80e67ef9

The most recurrence places are clr wrm and cl info:

process: >> cl wrm --force 0x51e6882ec1294b9ebe89a149059a204b
process:  (exit code 0, expected 0)
process: 
process: 
process: >> cl info -f state 0x4184042dde0f4b4788df617b80e67ef9
process:  (exit code 0, expected 0)

candicegjing commented 4 years ago

@al-cheb Thank you very much for helping us check this issue. I also saw the 2 hour pause in the build mentioned at https://github.com/actions/virtual-environments/issues/533#issuecomment-600951994. It seems like jobs can be paused at random stages. Do you by any chance know what could cause this issue? When we run this build locally, in Travis and in Git Action (before the end of Jan.2020), we don't see this issue and jobs are able to run seamlessly. I have the following thoughts, please correct me if I am wrong given that I am not the expert of this platform:

Will this affected by Git Action billing although our repository is public and hence there shouldn't be any restrictions?
Will any network configuration changes affect this?
Considering what @nikita-bykov mentioned in https://github.com/actions/virtual-environments/issues/533#issuecomment-599927087, if there is no change on the image itself, what's the next layer on top of the image that could cause this? (e.g. job scheduling part?) The situation looks like the computing resources previously owned by our build are switched out somehow in the middle for 2 hours and then switched back later on.

al-cheb commented 4 years ago

@candicegjing I have tried Europe location with another Hyper-V virtualization with fresh Ubuntu 18.04.4(without any network limits) and build stacked at the beginning with the same symptoms.

docker

candicegjing commented 4 years ago

@al-cheb Thanks for helping us check on this issue again! I rerun our build several times today and found the freezing point is quite random, e.g. This build got paused at:

Thu, 19 Mar 2020 23:52:52 GMT process: [*][*] BEGIN MODULE: basic
Thu, 19 Mar 2020 23:52:52 GMT process: 
Thu, 19 Mar 2020 23:52:52 GMT process: [*][*] SWITCHING TO TEMPORARY WORKSHEET
Fri, 20 Mar 2020 02:03:39 GMT process: >> cl work -u
Fri, 20 Mar 2020 02:03:39 GMT process:  (exit code 0, expected 0)

Please advise us what we should do to migrate to Git Action next.

al-cheb commented 4 years ago

@candicegjing, Do you have any chance to run your build in Azure Pipelines and compare results?

candicegjing commented 4 years ago

@al-cheb I haven't tried Azure Pipeline, but I can definitely set that up and compare the result. However, I still think there is something changed over the past months that causes the build to pause in the middle of the entire life cycle, given the fact that I ran it before the end of Jan.2020 several times and didn't see this issue.

candicegjing commented 4 years ago

@al-cheb I setup our build in Azure Pipeline and it wasn't able to finish the build as well e.g. Azure Build raw log. What our build does is basically starting up several docker container services and run tests against them. We run our build regularly in the native ubuntu platform and Travis for a long time and haven't seen this issue so far until moving to Git Action.

I was wondering if there is something special to Action platform that could be missed to configure on our end?
Also, I found that this user seems to have the same situation as us: Unusual-Minutes-Usage-and-Run-Durations, although he is using a private repo.
Do you think our issue would be related to action/runner?

al-cheb commented 4 years ago

@candicegjing, I think we don't have enought freespace on an image. Could you please try the pipeline with clean up step?

jobs:
  build:
    runs-on: [ubuntu-latest]
    steps:
    - name: Clear freespace
      run: |
          sudo rm -rf /usr/share/dotnet
          sudo rm -rf /opt/ghc
          df -h
    - name: clone
      run: |
          git clone https://github.com/codalab/codalab-worksheets
          cd codalab-worksheets
          python3 -m pip install --upgrade pip
          python3 -m pip install setuptools
          python3 -m pip install -r requirements.txt
    - name: func
      run: |
          cd codalab-worksheets
          python3 codalab_service.py build services --version master --pull
          python3 codalab_service.py start --services default monitor --version master
          python3 codalab_service.py test --version master default
          python3 codalab_service.py test run --version master

candicegjing commented 4 years ago

@al-cheb I tried cleaning up space and it does make the trick and solve our problem! Thank you very much for pointing this out. One question I have is whether those directories that we are removing at the beginning of each build are shared with other builds as well? What I was worried is the following two cases

Will there be any problem when master running a build and a PR is also running a build if we try to write/clear from the same directory?
Will there be any problem this deletion operation affects others' build? I guess I am trying to understand the structure of our build system and not to break others' jobs. If you have any suggestions, please let me know.

al-cheb commented 4 years ago

@candicegjing, Each job and build runs in a fresh instance of a virtual environment.

https://help.github.com/en/actions/getting-started-with-github-actions/core-concepts-for-github-actions

Job A set of steps that execute on the same runner. You can define the dependency rules for how jobs run in a workflow file. Jobs can run at the same time in parallel or run sequentially depending on the status of a previous job. For example, a workflow can have two sequential jobs that build and test code, where the test job is dependent on the status of the build job. If the build job fails, the test job will not run. For GitHub-hosted runners, each job in a workflow runs in a fresh instance of a virtual environment.

maxim-lobanov commented 4 years ago

@al-cheb Could you please check how many free space we are currently have on Ubuntu / Windows images? Based on documentation, all images should have at least 14 GB of free space.

al-cheb commented 4 years ago

@maxim-lobanov, To pass the customer build we should have at least 25-30GB as twice time more than the have now(14-16 GB) free space

Ubuntu ~ 17 gb
Windows ~ 125gb

maxim-lobanov commented 4 years ago

@al-cheb , I see, thank you for checking! @candicegjing , Please let us know if you have any additional questions

candicegjing commented 4 years ago

Thanks @al-cheb and @maxim-lobanov! I was also wondering if we want to run our build on different OS, e.g. windows and macOS, should we clean up different directories?

maxim-lobanov commented 4 years ago

@candicegjing , I think it should not be a problem for macOS and Windows and we need a clean up only on Ubuntu. macOS image contain about 80 GB of free space. Windows images - 125 (As Alex mentioned). So I would suggest to set condition for clean up step to run only on Ubuntu

al-cheb commented 4 years ago

@candicegjing , Closing the issue, but feel free to reopen if you have any concerns.

epicfaace commented 4 years ago

@al-cheb @maxim-lobanov this cleanup step takes a while -- see that it takes 6 minutes and 21 seconds in this case -- https://github.com/codalab/codalab-worksheets/runs/696582920?check_suite_focus=true

Is it possible to make this faster, or alternatively, just increase the default disk space available on Ubuntu?

maxim-lobanov commented 4 years ago

Hello @epicfaace , Sorry, but for now, we have no opportunity to extend our Ubuntu images. Sometimes we deprecate old software and increase disk space but then we add new software and decrease it again. Based on our documentation, our machines contains at least 14 GB of free space (currently, 16 GB). But I would say that you should not rely on having more than 14 GB.

As for the time of workaround with removing folders, How much space do you need for your build? rm -rf /usr/share/dotnet is pretty long but release ~20GB of free space. rm -rf /opt/ghc releases 8 GB. So if you do only rm -rf /opt/ghc, you will have 22 GB and it will take 20 seconds

al-cheb commented 4 years ago

@epicfaace Should be faster to remove these foders:

sudo find /usr/share/dotnet -delete
sudo find /opt/ghc -delete

epicfaace commented 4 years ago

Thanks for your help! Just doing rm -rf /opt/ghc did the trick for me.

actions / runner-images

Python being slower than usual or execution paused for 2 hours #533