GoogleCloudPlatform / artifact-registry-apt-transport

Apache License 2.0
9 stars 15 forks source link

Missing timeouts result in apt remaining forever locked #23

Open chrisboulton opened 4 months ago

chrisboulton commented 4 months ago

We're noticing from time to time that apt updates never complete, and remain locked forever presumably due to a missing timeout and some kind of underlying network issue. Each time we've looked at this, the AR transport binary still seems to be running, which makes me think the missing timeout is somewhere within it.

$ apt-get update
Reading package lists... Done
E: Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
E: Unable to lock directory /var/lib/apt/lists/
$ ps aux | grep -i apt
root     18760  0.0  0.0  37564  7040 ?        S    Mar31   6:06 /usr/bin/apt-get update
_apt     18768  0.0  0.1  45420  9180 ?        S    Mar31   0:00 /usr/lib/apt/methods/https
_apt     18769  0.0  0.1  45420  9108 ?        S    Mar31   0:00 /usr/lib/apt/methods/https
root     18770  0.0  0.1 108624 10468 ?        Sl   Mar31   0:48 /usr/lib/apt/methods/ar+https
_apt     18774  0.0  0.0  42388  6624 ?        S    Mar31   0:00 /usr/lib/apt/methods/http
_apt     18775  0.0  0.0  42396  6596 ?        S    Mar31   0:00 /usr/lib/apt/methods/http
_apt     18780  0.0  0.0  36412  5680 ?        S    Mar31   0:00 /usr/lib/apt/methods/gpgv

$ pstree -ap 18760
apt-get,18760 update
  ├─ar+https,18770
  │   ├─{ar+https},18771
  │   ├─{ar+https},18772
  │   ├─{ar+https},18773
  │   ├─{ar+https},18776
  │   ├─{ar+https},18777
  │   └─{ar+https},18778
  ├─gpgv,18780
  ├─http,18774
  ├─http,18775
  ├─https,18768
  └─https,18769

Unfortunately don't have any logs or anything else available, as we mostly notice this when it's triggered via OSConfigAgent, which only seems to collect the resulting apt "exited uncleanly" error when you eventually kill ar+https.

I haven't had a look through the code to see if there a missing timeouts or context propagations. It might also be worth adding a retry mechanism.