buildstream-migration / bst-staging

GNU Lesser General Public License v2.1
0 stars 0 forks source link

CAS server doesn't seem to automatically remove artifacts so it gets fill up #609

Closed Cynical-Optimist closed 4 years ago

Cynical-Optimist commented 4 years ago

See original issue on GitLab In GitLab by [Gitlab user @jjardon] on Aug 28, 2018, 08:19

What is the current bug behavior?

This is a continuation of #136:

Even with the new cache server implementation (CAS), the cache server gets fill up and It doesn't clean the artifacts automatically

Bugs in downstream projects:

What is the expected correct behavior?

The server should automatically remove artifacts to have enough space

Relevant logs and/or screenshots

https://gitlab.gnome.org/GNOME/gnome-build-meta/-/jobs/92168

From the journal on the server (not messages at all from bst-artifact-server before that):

Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]: Exception calling application: Failed to hash object: [Errno 28] No space left on device
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]: Traceback (most recent call last):
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:   File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 398, in add_object
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:     out.write(chunk)
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:   File "/usr/lib64/python3.6/tempfile.py", line 483, in func_wrapper
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:     return func(*args, **kwargs)
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]: OSError: [Errno 28] No space left on device
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]: During handling of the above exception, another exception occurred:
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]: Traceback (most recent call last):
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:   File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 411, in add_object
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:     os.link(out.name, objpath)
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:   File "/usr/lib64/python3.6/tempfile.py", line 501, in __exit__
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:     result = self.file.__exit__(exc, value, tb)
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]: OSError: [Errno 28] No space left on device
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]: The above exception was the direct cause of the following exception:
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]: Traceback (most recent call last):
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:   File "/usr/local/lib64/python3.6/site-packages/grpc/_server.py", line 389, in _call_behavior
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:     return behavior(argument, context), True
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:   File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/casserver.py", line 186, in Write
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:     digest = self.cas.add_object(path=out.name)
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:   File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 418, in add_object
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]:     raise ArtifactError("Failed to hash object: {}".format(e)) from e
Aug 17 21:30:52 cache-server-fedora-s-2vcpu-2gb-ams3-01 bst-artifact-server[1067]: buildstream._exceptions.ArtifactError: Failed to hash object: [Errno 28] No space left on device

Other relevant information

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @coldtom] on Aug 28, 2018, 09:11

You don't seem to be using the cache option in build.conf, so the cache size defaults to infinity. See https://buildstream.gitlab.io/buildstream/using_config.html#default-configuration

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Aug 28, 2018, 09:13

[Gitlab user @coldtom] thanks; I would say the defaults are incorrect then: It should be the maximum available storage, not infinity

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @Qinusty] on Aug 28, 2018, 16:29

[Gitlab user @coldtom], [Gitlab user @jjardon]: The cache: quota configuration within build.conf is for Buildstream locally. The cache server is configured separately.

I'd imagine we need some information regarding the actual CASServer maybe some logs. But perhaps [Gitlab user @juergbi] could be of more help in this area.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @toscalix] on Aug 29, 2018, 09:26

assigned to [Gitlab user @jjardon]

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Aug 29, 2018, 09:28

Looks like the server is not configurable. It cleans up to leave 2 GB free.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @Qinusty] on Aug 29, 2018, 09:31

casserver.py makes calls to python logging during cache cleaning. I'm not certain that these logs are stored to file.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Aug 29, 2018, 10:19

I think the problem is that the prune method is not called.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Aug 29, 2018, 10:20

Something like that?

diff --git a/buildstream/_artifactcache/casserver.py b/buildstream/_artifactcache/casserver.py
index 0af65729..e1e2d9d0 100644
--- a/buildstream/_artifactcache/casserver.py
+++ b/buildstream/_artifactcache/casserver.py
[[Gitlab user @]](https://gitlab.com/)[[Gitlab user @]](https://gitlab.com/) -369,4 +369,6 [[Gitlab user @]](https://gitlab.com/)[[Gitlab user @]](https://gitlab.com/) def _clean_up_cache(cas, object_size):
     else:
         logging.info("No artifacts were removed from the cache.")

+    cas.prune()
+
     return removed_size
Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Aug 29, 2018, 10:24

Scrap that. I misread. defer_prune is false.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Aug 29, 2018, 16:21

changed the description

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Aug 29, 2018, 16:21

[Gitlab user @Qinusty] I have attached the only relevant logs I can find in the journal

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Aug 29, 2018, 16:22

unassigned [Gitlab user @jjardon]

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Aug 29, 2018, 16:22

changed the description

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @tristanvb] on Aug 29, 2018, 16:27

Ah these logs are interesting, unhandled OSError which leads to crash instead of handled OSError leading to immediate pruning of the cache.

I presume that this might be happening due to the incoming artifacts being larger than the headroom set aside for incoming artifacts.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Aug 29, 2018, 16:33

The cache server is relatively small (70GB), so we will hit this again in around 15 days; let us know if we have to activate some debug mode or something to help you with the issue

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Sep 1, 2018, 11:21

OK, it filled up earlier than I though: https://gitlab.gnome.org/GNOME/gnome-build-meta/-/jobs/94897

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @tristanvb] on Sep 3, 2018, 09:45

marked this issue as related to #623

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @tlater] on Sep 3, 2018, 10:25

[Gitlab user @tristanvb] is probably right, [Gitlab user @jennis] can probably make a good guess too.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @Qinusty] on Sep 13, 2018, 15:34

Bump, do we have an update on this? This issue is marked as urgent and has no current assignee.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 18, 2018, 11:04

mentioned in merge request website!88

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @tlater] on Sep 18, 2018, 15:47

assigned to [Gitlab user @tlater]

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @tlater] on Sep 18, 2018, 15:53

I will write a proper update soon, for now know that I am somewhat time-constrained and that this issue might take me a while.

If anyone else wants to take on the issue, feel free to - I can assist in setting up a debugging environment, if need-be.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Sep 18, 2018, 22:34

changed the description

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Sep 22, 2018, 12:10

marked this issue as related to freedesktop-sdk/freedesktop-sdk#393

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Sep 22, 2018, 12:13

changed the description

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @toscalix] on Sep 24, 2018, 09:10

assigned to [Gitlab user @valentindavid]

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @toscalix] on Sep 24, 2018, 09:10

unassigned [Gitlab user @tlater]

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 24, 2018, 12:54

One big issue I have found is that we prune the cache before we set a ref. So when we are full, we just clean up everything we upload right away, then set a ref that has no object. I have not managed yet to reproduce the exact message. But I still get stuck with a full cache that just adds refs, but no data.

If we want to call the clean of cache during the upload of objects rather than by refs, then we need to make the LRU on objects (and then clean up refs when at least one object needed is gone).

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @tlater] on Sep 25, 2018, 11:45

I've found this while building with two separate BuildStream instances, both configured to push to the same cache:

Traceback (most recent call last):
  File "/usr/local/lib64/python3.6/site-packages/grpc/_server.py", line 389, in _call_behavior
    return behavior(argument, context), True
  File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/casserver.py", line 196, in Write
    _clean_up_cache(self.cas, client_digest.size_bytes)
  File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/casserver.py", line 414, in _clean_up_cache
    removed_size += cas.remove(to_remove, defer_prune=False)
  File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 600, in remove
    pruned = self.prune()
  File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 622, in prune
    self._reachable_refs_dir(reachable, tree)
  File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 776, in _reachable_refs_dir
    self._reachable_refs_dir(reachable, dirnode.digest)
  File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 776, in _reachable_refs_dir
    self._reachable_refs_dir(reachable, dirnode.digest)
  File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 776, in _reachable_refs_dir
    self._reachable_refs_dir(reachable, dirnode.digest)
  [Previous line repeated 1 more times]
  File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 769, in _reachable_refs_dir
    with open(self.objpath(tree), rb) as f:
FileNotFoundError: [Errno 2] No such file or directory: /data/cas/objects/a1/1cb0b10179740c3ebf7c875d08a672e60aaa01393c1974024857433d84b569

This also shows up on the client end as such:

--:--][3ea4ad7f][ push:bootstrap-junction.bst:linux-headers.bst] START   freedesktop-sdk-bootstrap/linux-headers/3ea4ad7f-push.21899.log
--:--][3ea4ad7f][ push:bootstrap-junction.bst:linux-headers.bst] START   Pushing artifact 3ea4ad7f
--:--][3ea4ad7f][ push:bootstrap-junction.bst:linux-headers.bst] INFO    Pushing 3ea4ad7f -> https://localhost:1102
00:07][3ea4ad7f][ push:bootstrap-junction.bst:linux-headers.bst] FAILURE Pushing artifact 3ea4ad7f
00:07][3ea4ad7f][ push:bootstrap-junction.bst:linux-headers.bst] FAILURE Failed to push artifact [freedesktop-sdk-bootstrap/linux-headers/3ea4ad7fdeaa1e71e10755f2de8be7b7813973167dbfa57c852d372a4082, freedesktop-sdk-bootstrap/linux-headers/45b163793d88f0442e532310835a41bd5fea395f92b552135012b0bb91b44083_Rendezvous of RPC that terminated with:
    status = StatusCode.UNKNOWN
    details = "Exception calling application: [Errno 2] No such file or directory: /data/cas/objects/a1/1cb0b10179740c3ebf7c875d08a672e60aaa01393c197402433d84b569"
    debug_error_string = "{"created":"[[Gitlab user @1537871374]](https://gitlab.com/1537871374).097158235","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":10grpc_message":"Exception calling application: [Errno 2] No such file or directory: /data/cas/objects/a1/1cb0b10179740c3ebf7c875d08a672e60aaa01393c197402433d84b569","grpc_status":2}"

What happened is obviously that the server committed an artifact while it was removing one - the same issue we had with the client side when we did not consider concurrency.

Looking at cascache.py we don't seem to consider this sort of scenario at all, artifact deletion and addition aren't exclusive. We need to fix that, and perhaps re-consider the client side exception shown. Maybe errors on the remote cache side should only show up as warnings, because the cache is just a cache and should not affect local builds?

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Sep 25, 2018, 16:08

[Gitlab user @tlater] I think you have actually found a different issue, as the original one show up like this in the client side (from https://gitlab.com/freedesktop-sdk/freedesktop-sdk/issues/393):

        [00:00:01] FAILURE desktop/mesa.bst: Failed to push artifact ['freedesktop-sdk/desktop-mesa/e8693573eec1fb5e4b294de92fe0013a74abe601edce95a1610ba3ba0b8fd6e8', 'freedesktop-sdk/desktop-mesa/fc162ce547b84b9f29b4b2449eff059220c94dd176b42903c7e2c86c3c76bff9']: <_Rendezvous of RPC that terminated with:
            status = StatusCode.UNKNOWN
            details = "Exception calling application: [Errno 28] No space left on device"
            debug_error_string = "{"created":"[[Gitlab user @1537577376]](https://gitlab.com/1537577376).788592799","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1095,"grpc_message":"Exception calling application: [Errno 28] No space left on device","grpc_status":2}"
        >

Do you mind to open a new issue until we are sure they are actually the same?

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:25

mentioned in commit fa84d4b7a88a41060bbb8e8bf1e3827ba5275852

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:25

mentioned in commit 1fa0412f2a920bc4e4afffec7d6eda5b15ae4e6b

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:25

mentioned in commit 33411e7cb2dbc63b93f80c2d698ee11de5169292

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:27

mentioned in merge request !830

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:28

mentioned in commit 8045e2324ec29f2530cd7eeb80fea81885a40768

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:28

mentioned in commit 064774a196ffa5ee7ef99b0167e95d0732b207fb

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:28

mentioned in commit 30de0d0b4dd6fb5b957b8f3a7157af98444a9489

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 18:06

I pushed !830 with my fixes.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 28, 2018, 15:42

mentioned in commit 0a332520539d34c8711c23757d4eaa994b5c0e57

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 28, 2018, 15:42

mentioned in commit e298af537899a5e43754209d4fe8db273c9148f5

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Sep 28, 2018, 15:42

mentioned in commit 4096277263f3f7796de62240b224d002048b910f

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Oct 3, 2018, 16:29

[Gitlab user @adds68] has set it up on the fix from !830 on the Freedesktop SDK infra. We will wait and see what happens when it fills up.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Oct 3, 2018, 16:34

The branch deployed on Freedesktop SDK infra is https://gitlab.com/freedesktop-sdk/infrastructure/infrastructure/tree/adamjones/cas-server

It uses commit 2e5ca36fe478de6ce0b15602b1041add1560dea1 of BuildStream

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Oct 9, 2018, 13:27

marked this issue as related to freedesktop-sdk/infrastructure/infrastructure#23

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Oct 10, 2018, 12:53

[Gitlab user @valentindavid] fdsk have been tested that branch for a while and we have several jobs getting stuck:

I have checked the CAS server and I notice that:

This is making the jobs to stuck forever, instead failing, which I think is worse. So we are coming back to install bst-1.2.3 in the server for now

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Oct 10, 2018, 16:38

mentioned in merge request freedesktop-sdk/freedesktop-sdk!582

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Oct 11, 2018, 16:22

mentioned in commit freedesktop-sdk/infrastructure/infrastructure[Gitlab user @85569b6ac54a8b3192de3bb91ec0abd9686df1ee]

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Oct 11, 2018, 16:22

mentioned in merge request freedesktop-sdk/infrastructure/infrastructure!25

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Oct 22, 2018, 21:10

[Gitlab user @jjardon] Do you have logs from the server?

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Oct 22, 2018, 21:13

Found some logs:

Oct 09 04:24:16 CAS-server bst-artifact-server[17482]: ERROR:root:Exception calling application: Method not implemented!
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]: Traceback (most recent call last):
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]:   File "/usr/local/lib64/python3.6/site-packages/grpc/_server.py", line 385, in _call_behavior
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]:     return behavior(argument, context), True
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]:   File "/usr/local/lib/python3.6/site-packages/buildstream/_protos/build/bazel/remote/execution/v2/remote_execution_pb2_grpc.py", line 461, in BatchUpdateBlobs
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]:     raise NotImplementedError('Method not implemented!')
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]: NotImplementedError: Method not implemented!