Closed gimenete closed 4 years ago
Hello @gimenete, @BrightRan! We double checked that there were no changes in the images that may affect the customer's build. We also used the same commit that the customer used in Jan.27.2020 for build and we got the same result as the client (the same time of the build, about 20 minutes) Probably the issue may be related to user's changes in repository and not to changes in our images.
Hello @nikita-bykov, thank you very much for helping us investigate this issue. We double-checked our commits after Jan.27.2020, they shouldn't lead the build paused for 2 hours in the middle. We currently have our build on Travis, and it has been stable around 30 minutes.
One thing I noticed that when I was playing with Git Action platform is that if I build the docker images from local and follow the rest of the other steps, it can still fall into the 20 mins time window. However, if I pull the docker images and run the rest of the test instead of building images locally, it will randomly pause for some time during the middle of the tests (not in the image pulling stage). Not sure if this information would be helpful though.
Actually, I just reverted our code to the commit on Jan.27.2020 at https://github.com/candicegjing/codalab-worksheets/runs/518107689, it runs for more than 1 hour now. I think this issue still persists in our repo. Please let us know if we can provide any information that would be helpful to investigate this issue.
Hello, @candicegjing After some investigation I have reproduced the issue on clean Azure Ubuntu 18.04 vm without any pre-installed software .
Process freeze on the stage in various places:
python3 codalab_service.py test --version master default
process: >> cl info -f state 0x4184042dde0f4b4788df617b80e67ef9
process: (exit code 0, expected 0)
process:
process: running
process:
process: >> cl kill 0x4184042dde0f4b4788df617b80e67ef9
process: (exit code 0, expected 0)
process:
process: 0x4184042dde0f4b4788df617b80e67ef9
process:
process: >> cl wait 0x4184042dde0f4b4788df617b80e67ef9
The most recurrence places are clr wrm and cl info:
process: >> cl wrm --force 0x51e6882ec1294b9ebe89a149059a204b
process: (exit code 0, expected 0)
process:
process:
process: >> cl info -f state 0x4184042dde0f4b4788df617b80e67ef9
process: (exit code 0, expected 0)
@al-cheb Thank you very much for helping us check this issue. I also saw the 2 hour pause in the build mentioned at https://github.com/actions/virtual-environments/issues/533#issuecomment-600951994. It seems like jobs can be paused at random stages. Do you by any chance know what could cause this issue? When we run this build locally, in Travis and in Git Action (before the end of Jan.2020), we don't see this issue and jobs are able to run seamlessly. I have the following thoughts, please correct me if I am wrong given that I am not the expert of this platform:
@candicegjing I have tried Europe location with another Hyper-V virtualization with fresh Ubuntu 18.04.4(without any network limits) and build stacked at the beginning with the same symptoms.
@al-cheb Thanks for helping us check on this issue again! I rerun our build several times today and found the freezing point is quite random, e.g. This build got paused at:
Thu, 19 Mar 2020 23:52:52 GMT process: [*][*] BEGIN MODULE: basic
Thu, 19 Mar 2020 23:52:52 GMT process:
Thu, 19 Mar 2020 23:52:52 GMT process: [*][*] SWITCHING TO TEMPORARY WORKSHEET
Fri, 20 Mar 2020 02:03:39 GMT process: >> cl work -u
Fri, 20 Mar 2020 02:03:39 GMT process: (exit code 0, expected 0)
Please advise us what we should do to migrate to Git Action next.
@candicegjing, Do you have any chance to run your build in Azure Pipelines and compare results?
@al-cheb I haven't tried Azure Pipeline, but I can definitely set that up and compare the result. However, I still think there is something changed over the past months that causes the build to pause in the middle of the entire life cycle, given the fact that I ran it before the end of Jan.2020 several times and didn't see this issue.
@al-cheb I setup our build in Azure Pipeline and it wasn't able to finish the build as well e.g. Azure Build raw log. What our build does is basically starting up several docker container services and run tests against them. We run our build regularly in the native ubuntu platform and Travis for a long time and haven't seen this issue so far until moving to Git Action.
I was wondering if there is something special to Action platform that could be missed to configure on our end?
Also, I found that this user seems to have the same situation as us: Unusual-Minutes-Usage-and-Run-Durations, although he is using a private repo.
Do you think our issue would be related to action/runner?
@candicegjing, I think we don't have enought freespace on an image. Could you please try the pipeline with clean up step?
jobs:
build:
runs-on: [ubuntu-latest]
steps:
- name: Clear freespace
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
df -h
- name: clone
run: |
git clone https://github.com/codalab/codalab-worksheets
cd codalab-worksheets
python3 -m pip install --upgrade pip
python3 -m pip install setuptools
python3 -m pip install -r requirements.txt
- name: func
run: |
cd codalab-worksheets
python3 codalab_service.py build services --version master --pull
python3 codalab_service.py start --services default monitor --version master
python3 codalab_service.py test --version master default
python3 codalab_service.py test run --version master
@al-cheb I tried cleaning up space and it does make the trick and solve our problem! Thank you very much for pointing this out. One question I have is whether those directories that we are removing at the beginning of each build are shared with other builds as well? What I was worried is the following two cases
@candicegjing, Each job and build runs in a fresh instance of a virtual environment.
Job A set of steps that execute on the same runner. You can define the dependency rules for how jobs run in a workflow file. Jobs can run at the same time in parallel or run sequentially depending on the status of a previous job. For example, a workflow can have two sequential jobs that build and test code, where the test job is dependent on the status of the build job. If the build job fails, the test job will not run. For GitHub-hosted runners, each job in a workflow runs in a fresh instance of a virtual environment.
@al-cheb Could you please check how many free space we are currently have on Ubuntu / Windows images? Based on documentation, all images should have at least 14 GB of free space.
@maxim-lobanov, To pass the customer build we should have at least 25-30GB as twice time more than the have now(14-16 GB) free space
@al-cheb , I see, thank you for checking! @candicegjing , Please let us know if you have any additional questions
Thanks @al-cheb and @maxim-lobanov! I was also wondering if we want to run our build on different OS, e.g. windows and macOS, should we clean up different directories?
@candicegjing , I think it should not be a problem for macOS and Windows and we need a clean up only on Ubuntu. macOS image contain about 80 GB of free space. Windows images - 125 (As Alex mentioned). So I would suggest to set condition for clean up step to run only on Ubuntu
@candicegjing , Closing the issue, but feel free to reopen if you have any concerns.
@al-cheb @maxim-lobanov this cleanup step takes a while -- see that it takes 6 minutes and 21 seconds in this case -- https://github.com/codalab/codalab-worksheets/runs/696582920?check_suite_focus=true
Is it possible to make this faster, or alternatively, just increase the default disk space available on Ubuntu?
Hello @epicfaace , Sorry, but for now, we have no opportunity to extend our Ubuntu images. Sometimes we deprecate old software and increase disk space but then we add new software and decrease it again. Based on our documentation, our machines contains at least 14 GB of free space (currently, 16 GB). But I would say that you should not rely on having more than 14 GB.
As for the time of workaround with removing folders, How much space do you need for your build?
rm -rf /usr/share/dotnet
is pretty long but release ~20GB of free space.
rm -rf /opt/ghc
releases 8 GB.
So if you do only rm -rf /opt/ghc
, you will have 22 GB and it will take 20 seconds
@epicfaace Should be faster to remove these foders:
sudo find /usr/share/dotnet -delete
sudo find /opt/ghc -delete
Thanks for your help! Just doing rm -rf /opt/ghc
did the trick for me.
Describe the bug
A customer reported that in a workflow, a step takes much longer time to run now than in the past, Jan 2020, when this step just took about 20 min to run, but now it will take more than one hour.
cc @BrightRan
Associated ticket: https://github.community/t5/GitHub-Actions/Python-workflow-runs-much-longer-now-than-in-Jan-27-2020/td-p/49428
Area for Triage:
Question, Bug, or Feature?:
Virtual environments affected
Expected behavior
Same speed than before.
Actual behavior
If we take a closer look at the later log, we noticed that the execution had been paused for 2 hours between the following actions: