ReproNim / reproman

ReproMan (AKA NICEMAN, AKA ReproNim TRD3)
https://reproman.readthedocs.io
Other
24 stars 14 forks source link

ci: Migrate to GitHub Actions #571

Closed kyleam closed 3 years ago

kyleam commented 3 years ago

With the pending travis.org shutdown, we need to migrate to travis.com or another CI service. This series switches to using GitHub actions.

All the old .travis.yml jobs should have a corresponding job in .github/workflows/test.yml. As noted in the commit that adds the condor job, there was initially a condor_q-related core dump, but the latest runs haven't been triggering it, so (assuming it doesn't pop up again in this PR) that's obviously something to keep an eye on and ideally figure out.

codecov[bot] commented 3 years ago

Codecov Report

Merging #571 (91872de) into master (8277a6d) will increase coverage by 0.09%. The diff coverage is 90.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #571      +/-   ##
==========================================
+ Coverage   89.11%   89.20%   +0.09%     
==========================================
  Files         149      149              
  Lines       12734    13035     +301     
==========================================
+ Hits        11348    11628     +280     
- Misses       1386     1407      +21     
Impacted Files Coverage Δ
reproman/distributions/tests/test_debian.py 98.12% <83.33%> (-0.56%) :arrow_down:
reproman/distributions/tests/test_venv.py 95.72% <100.00%> (+0.23%) :arrow_up:
reproman/log.py 64.28% <0.00%> (-3.58%) :arrow_down:
reproman/support/tests/test_external_versions.py 93.81% <0.00%> (-3.03%) :arrow_down:
reproman/ui/dialog.py 38.59% <0.00%> (-2.64%) :arrow_down:
reproman/support/jobs/submitters.py 69.36% <0.00%> (-2.26%) :arrow_down:
reproman/distributions/redhat.py 94.51% <0.00%> (-0.61%) :arrow_down:
reproman/ui/tests/test_base.py 23.07% <0.00%> (-0.46%) :arrow_down:
reproman/tests/fixtures.py 100.00% <0.00%> (ø)
reproman/tests/test_api.py 100.00% <0.00%> (ø)
... and 59 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8277a6d...c3b9436. Read the comment docs.

kyleam commented 3 years ago

assuming [the condor_q core dump] doesn't pop up again in this PR

It did: https://github.com/ReproNim/reproman/pull/571/checks?check_run_id=1856173988


Update: Downgrading to ubuntu-16.04 for that job seems to sidestep the issue.

https://github.com/ReproNim/reproman/runs/1856702977


Update 2: Never mind, it reappeared on the next ubuntu-16.04 run, so it seems to be flaky.

https://github.com/ReproNim/reproman/pull/571/checks?check_run_id=1857298830

yarikoptic commented 3 years ago

reporting: I keep pestering @mih to give a try to new condor somehow ;)

yarikoptic commented 3 years ago

I have just uploaded 8.8.6~dfsg.1-1 to /debian-devel of neurodebian for where it built:

neurodebian@smaug ~/deb/builds/htcondor/8.8.6~dfsg.1-1 % grep -v OLD summary.build | grep OK
condor_8.8.6~dfsg.1-1~nd90+1_i386.build OK  4:43.09 real, 278.53 user, 97.98 sys, 4980264 out
condor_8.8.6~dfsg.1-1~nd90+1_amd64.build    OK  5:01.60 real, 298.86 user, 130.28 sys, 5919456 out
condor_8.8.6~dfsg.1-1~nd100+1_i386.build    OK  6:51.86 real, 2159.03 user, 250.44 sys, 13854248 out
condor_8.8.6~dfsg.1-1~nd100+1_amd64.build   OK  6:58.10 real, 1974.24 user, 318.46 sys, 15727608 out
condor_8.8.6~dfsg.1-1~nd18.04+1_i386.build  OK  7:04.83 real, 2281.48 user, 247.67 sys, 11432416 out
condor_8.8.6~dfsg.1-1~nd18.04+1_amd64.build OK  7:06.03 real, 2066.80 user, 321.39 sys, 12692800 out

note that 16.04 build failed, so we would need to do it on 18.04 where it seems to be ok

$ sed -e 's,/debian ,/debian-devel ,g' < /etc/apt/sources.list.d/neurodebian.sources.list | grep -v data > /etc/apt/sources.list.d/neurodebian-devel.sources.list

apt-get update

$ root@2235f71638d6:/# apt-cache policy htcondor
htcondor:
  Installed: (none)
  Candidate: 8.8.6~dfsg.1-1~nd18.04+1
  Version table:
     8.8.6~dfsg.1-1~nd18.04+1 500
        500 http://neuro.debian.net/debian-devel bionic/main amd64 Packages
     8.6.8~dfsg.1-2 500
        500 http://archive.ubuntu.com/ubuntu bionic/universe amd64 Packages
kyleam commented 3 years ago

I have just uploaded 8.8.6~dfsg.1-1

Thanks!

kyleam commented 3 years ago

Hmm, it's at ~40 minutes now, so it seems likely that something got stuck (perhaps reproman run waiting around because the newer version isn't giving the output it expects). Will try to upgrade locally and debug.

kyleam commented 3 years ago

so it seems likely that something got stuck

Based on the log [1], it's stuck waiting on a held job:

2021-02-09T21:29:31.1422549Z 2021-02-09 21:29:31,141 [INFO] Waiting on job 2: held. Next heartbeat in 766 seconds
debbugging in Ubuntu 18.04 VM I installed condor 8.8.6 into a Ubuntu 18.04 VM. Running from a new dataset in "/tmp/test", this simple job gets held: ``` reproman run -r local --follow --sub condor sh -c 'echo hey' ``` Here's the reason given by `condor_q --held`: ``` -- Schedd: kestrel : <127.0.0.1:9618?... @ 02/10/21 16:07:18 ID OWNER HELD_SINCE HOLD_REASON 1.0 kyle 2/10 16:06 Error from slot1@kestrel: Failed to execute '/home/kyle/.reproman/run-root/20210210-160634-98b7/.reproman/jobs/local/20210210-160634-98b7/runscript' with arguments 0: Cannot access initial working directory "/tmp/test" (errno=2: 'No such file or directory') Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended Total for kyle: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended Total for all users: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended ``` So /tmp/test doesn't exist. pytest would be operating from /tmp, so that's consistent with the CI getting stuck. The same `reproman run` command launched from ~/scratch/test instead of /tmp/test completes fine. Here's how /tmp looks outside of the condor job: ``` drwxr-xr-x 5 kyle kyle 4096 Feb 10 16:04 /tmp/test ``` And here's what the condor job sees: ```sh condor_version condor_run 'ls -ld /tmp' condor_run 'ls -l /tmp' ``` ``` $CondorVersion: 8.8.6 Feb 09 2021 BuildID: Debian-8.8.6~dfsg.1-1~nd18.04+1 Debian-8.8.6~dfsg.1-1~nd18.04+1 $ $CondorPlatform: X86_64-Ubuntu_18.04 $ drwx------ 2 kyle kyle 4096 Feb 10 16:00 /tmp total 0 ``` Going back to the old, non-devel condor version (8.6.8), condor sees the regular /tmp directory, and /tmp/test _is_ there: ``` $CondorVersion: 8.6.8 Apr 06 2018 BuildID: Debian-8.6.8~dfsg.1-2 Debian-8.6.8~dfsg.1-2 $ $CondorPlatform: X86_64-Ubuntu_ $ drwxrwxrwt 16 root root 4096 Feb 10 16:02 /tmp total 536 -rw------- 1 kyle kyle 0 Feb 10 13:55 config-err-PU1Ts6 drwx------ 2 kyle kyle 4096 Feb 10 13:55 ssh-7vProW84d1AH drwx------ 3 root root 4096 Feb 10 13:55 systemd-private-b5c0ddacf32343c2b6c5ab1cc710c97c-bolt.service-HPDtLV drwx------ 3 root root 4096 Feb 10 13:55 systemd-private-b5c0ddacf32343c2b6c5ab1cc710c97c-colord.service-UhRVFK drwx------ 3 root root 4096 Feb 10 13:56 systemd-private-b5c0ddacf32343c2b6c5ab1cc710c97c-fwupd.service-Wtz62q drwx------ 3 root root 4096 Feb 10 13:55 systemd-private-b5c0ddacf32343c2b6c5ab1cc710c97c-ModemManager.service-xwK4Y8 drwx------ 3 root root 4096 Feb 10 13:55 systemd-private-b5c0ddacf32343c2b6c5ab1cc710c97c-rtkit-daemon.service-sl9u5b drwx------ 3 root root 4096 Feb 10 13:55 systemd-private-b5c0ddacf32343c2b6c5ab1cc710c97c-systemd-resolved.service-sqWSGj drwx------ 3 root root 4096 Feb 10 13:55 systemd-private-b5c0ddacf32343c2b6c5ab1cc710c97c-systemd-timesyncd.service-JiTPUI drwxr-xr-x 4 kyle kyle 4096 Feb 10 14:26 test -rw------- 1 kyle kyle 511815 Feb 10 13:59 tmpaddon ```

The solution likely involves tweaking condor's MOUNT_UNDER_SCRATCH.

[1] https://github.com/ReproNim/reproman/pull/571/checks?check_run_id=1866468976 Sadly (but I think unrelatedly) those are filled by up many BlockingIOErrors).

yarikoptic commented 3 years ago

[1] https://github.com/ReproNim/reproman/pull/571/checks?check_run_id=1866468976 Sadly (but I think unrelatedly) those are filled by up many BlockingIOErrors).

oh no -- is that "from" datalad?

kyleam commented 3 years ago

oh no -- is that "from" datalad?

Yes, I think so (based on not spotting them in builds without datalad)

kyleam commented 3 years ago

First non-stuck job with newer condor was green (2090aba). I pushed another commit switching an unrelated py 3.6 job to 3.9, mostly to see whether the condor job turns up green again (because with the older version the core dump is flaky). Hopefully that doesn't surface a 3.9-specfic error, though.

kyleam commented 3 years ago

There have been four non-stuck runs with condor 8.8, and none of them triggered the core dump. With condor 8.6, it wasn't reliably triggered, though there wasn't a streak of four runs without it. So, not conclusive, but I think there's a good chance using 8.8 sidesteps the issue.

yarikoptic commented 3 years ago

And no immediate ideas about coverage ;-)