Closed Cynical-Optimist closed 4 years ago
In GitLab by [Gitlab user @coldtom] on Aug 28, 2018, 09:11
You don't seem to be using the cache
option in build.conf
, so the cache size defaults to infinity. See https://buildstream.gitlab.io/buildstream/using_config.html#default-configuration
In GitLab by [Gitlab user @jjardon] on Aug 28, 2018, 09:13
[Gitlab user @coldtom] thanks; I would say the defaults are incorrect then: It should be the maximum available storage, not infinity
In GitLab by [Gitlab user @Qinusty] on Aug 28, 2018, 16:29
[Gitlab user @coldtom], [Gitlab user @jjardon]: The cache: quota
configuration within build.conf
is for Buildstream locally. The cache server is configured separately.
I'd imagine we need some information regarding the actual CASServer maybe some logs. But perhaps [Gitlab user @juergbi] could be of more help in this area.
In GitLab by [Gitlab user @toscalix] on Aug 29, 2018, 09:26
assigned to [Gitlab user @jjardon]
In GitLab by [Gitlab user @valentindavid] on Aug 29, 2018, 09:28
Looks like the server is not configurable. It cleans up to leave 2 GB free.
In GitLab by [Gitlab user @Qinusty] on Aug 29, 2018, 09:31
casserver.py makes calls to python logging
during cache cleaning. I'm not certain that these logs are stored to file.
In GitLab by [Gitlab user @valentindavid] on Aug 29, 2018, 10:19
I think the problem is that the prune
method is not called.
In GitLab by [Gitlab user @valentindavid] on Aug 29, 2018, 10:20
Something like that?
diff --git a/buildstream/_artifactcache/casserver.py b/buildstream/_artifactcache/casserver.py
index 0af65729..e1e2d9d0 100644
--- a/buildstream/_artifactcache/casserver.py
+++ b/buildstream/_artifactcache/casserver.py
[[Gitlab user @]](https://gitlab.com/)[[Gitlab user @]](https://gitlab.com/) -369,4 +369,6 [[Gitlab user @]](https://gitlab.com/)[[Gitlab user @]](https://gitlab.com/) def _clean_up_cache(cas, object_size):
else:
logging.info("No artifacts were removed from the cache.")
+ cas.prune()
+
return removed_size
In GitLab by [Gitlab user @valentindavid] on Aug 29, 2018, 10:24
Scrap that. I misread. defer_prune
is false.
In GitLab by [Gitlab user @jjardon] on Aug 29, 2018, 16:21
changed the description
In GitLab by [Gitlab user @jjardon] on Aug 29, 2018, 16:21
[Gitlab user @Qinusty] I have attached the only relevant logs I can find in the journal
In GitLab by [Gitlab user @jjardon] on Aug 29, 2018, 16:22
unassigned [Gitlab user @jjardon]
In GitLab by [Gitlab user @jjardon] on Aug 29, 2018, 16:22
changed the description
In GitLab by [Gitlab user @tristanvb] on Aug 29, 2018, 16:27
Ah these logs are interesting, unhandled OSError
which leads to crash instead of handled OSError
leading to immediate pruning of the cache.
I presume that this might be happening due to the incoming artifacts being larger than the headroom set aside for incoming artifacts.
In GitLab by [Gitlab user @jjardon] on Aug 29, 2018, 16:33
The cache server is relatively small (70GB), so we will hit this again in around 15 days; let us know if we have to activate some debug mode or something to help you with the issue
In GitLab by [Gitlab user @jjardon] on Sep 1, 2018, 11:21
OK, it filled up earlier than I though: https://gitlab.gnome.org/GNOME/gnome-build-meta/-/jobs/94897
In GitLab by [Gitlab user @tristanvb] on Sep 3, 2018, 09:45
marked this issue as related to #623
In GitLab by [Gitlab user @tlater] on Sep 3, 2018, 10:25
[Gitlab user @tristanvb] is probably right, [Gitlab user @jennis] can probably make a good guess too.
In GitLab by [Gitlab user @Qinusty] on Sep 13, 2018, 15:34
Bump, do we have an update on this? This issue is marked as urgent and has no current assignee.
In GitLab by [Gitlab user @valentindavid] on Sep 18, 2018, 11:04
mentioned in merge request website!88
In GitLab by [Gitlab user @tlater] on Sep 18, 2018, 15:47
assigned to [Gitlab user @tlater]
In GitLab by [Gitlab user @tlater] on Sep 18, 2018, 15:53
I will write a proper update soon, for now know that I am somewhat time-constrained and that this issue might take me a while.
If anyone else wants to take on the issue, feel free to - I can assist in setting up a debugging environment, if need-be.
In GitLab by [Gitlab user @jjardon] on Sep 18, 2018, 22:34
changed the description
In GitLab by [Gitlab user @jjardon] on Sep 22, 2018, 12:10
marked this issue as related to freedesktop-sdk/freedesktop-sdk#393
In GitLab by [Gitlab user @jjardon] on Sep 22, 2018, 12:13
changed the description
In GitLab by [Gitlab user @toscalix] on Sep 24, 2018, 09:10
assigned to [Gitlab user @valentindavid]
In GitLab by [Gitlab user @toscalix] on Sep 24, 2018, 09:10
unassigned [Gitlab user @tlater]
In GitLab by [Gitlab user @valentindavid] on Sep 24, 2018, 12:54
One big issue I have found is that we prune the cache before we set a ref. So when we are full, we just clean up everything we upload right away, then set a ref that has no object. I have not managed yet to reproduce the exact message. But I still get stuck with a full cache that just adds refs, but no data.
If we want to call the clean of cache during the upload of objects rather than by refs, then we need to make the LRU on objects (and then clean up refs when at least one object needed is gone).
In GitLab by [Gitlab user @tlater] on Sep 25, 2018, 11:45
I've found this while building with two separate BuildStream instances, both configured to push to the same cache:
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/grpc/_server.py", line 389, in _call_behavior
return behavior(argument, context), True
File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/casserver.py", line 196, in Write
_clean_up_cache(self.cas, client_digest.size_bytes)
File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/casserver.py", line 414, in _clean_up_cache
removed_size += cas.remove(to_remove, defer_prune=False)
File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 600, in remove
pruned = self.prune()
File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 622, in prune
self._reachable_refs_dir(reachable, tree)
File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 776, in _reachable_refs_dir
self._reachable_refs_dir(reachable, dirnode.digest)
File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 776, in _reachable_refs_dir
self._reachable_refs_dir(reachable, dirnode.digest)
File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 776, in _reachable_refs_dir
self._reachable_refs_dir(reachable, dirnode.digest)
[Previous line repeated 1 more times]
File "/usr/local/lib/python3.6/site-packages/buildstream/_artifactcache/cascache.py", line 769, in _reachable_refs_dir
with open(self.objpath(tree), rb) as f:
FileNotFoundError: [Errno 2] No such file or directory: /data/cas/objects/a1/1cb0b10179740c3ebf7c875d08a672e60aaa01393c1974024857433d84b569
This also shows up on the client end as such:
--:--][3ea4ad7f][ push:bootstrap-junction.bst:linux-headers.bst] START freedesktop-sdk-bootstrap/linux-headers/3ea4ad7f-push.21899.log
--:--][3ea4ad7f][ push:bootstrap-junction.bst:linux-headers.bst] START Pushing artifact 3ea4ad7f
--:--][3ea4ad7f][ push:bootstrap-junction.bst:linux-headers.bst] INFO Pushing 3ea4ad7f -> https://localhost:1102
00:07][3ea4ad7f][ push:bootstrap-junction.bst:linux-headers.bst] FAILURE Pushing artifact 3ea4ad7f
00:07][3ea4ad7f][ push:bootstrap-junction.bst:linux-headers.bst] FAILURE Failed to push artifact [freedesktop-sdk-bootstrap/linux-headers/3ea4ad7fdeaa1e71e10755f2de8be7b7813973167dbfa57c852d372a4082, freedesktop-sdk-bootstrap/linux-headers/45b163793d88f0442e532310835a41bd5fea395f92b552135012b0bb91b44083_Rendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception calling application: [Errno 2] No such file or directory: /data/cas/objects/a1/1cb0b10179740c3ebf7c875d08a672e60aaa01393c197402433d84b569"
debug_error_string = "{"created":"[[Gitlab user @1537871374]](https://gitlab.com/1537871374).097158235","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":10grpc_message":"Exception calling application: [Errno 2] No such file or directory: /data/cas/objects/a1/1cb0b10179740c3ebf7c875d08a672e60aaa01393c197402433d84b569","grpc_status":2}"
What happened is obviously that the server committed an artifact while it was removing one - the same issue we had with the client side when we did not consider concurrency.
Looking at cascache.py
we don't seem to consider this sort of scenario at all, artifact deletion and addition aren't exclusive. We need to fix that, and perhaps re-consider the client side exception shown. Maybe errors on the remote cache side should only show up as warnings, because the cache is just a cache and should not affect local builds?
In GitLab by [Gitlab user @jjardon] on Sep 25, 2018, 16:08
[Gitlab user @tlater] I think you have actually found a different issue, as the original one show up like this in the client side (from https://gitlab.com/freedesktop-sdk/freedesktop-sdk/issues/393):
[00:00:01] FAILURE desktop/mesa.bst: Failed to push artifact ['freedesktop-sdk/desktop-mesa/e8693573eec1fb5e4b294de92fe0013a74abe601edce95a1610ba3ba0b8fd6e8', 'freedesktop-sdk/desktop-mesa/fc162ce547b84b9f29b4b2449eff059220c94dd176b42903c7e2c86c3c76bff9']: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception calling application: [Errno 28] No space left on device"
debug_error_string = "{"created":"[[Gitlab user @1537577376]](https://gitlab.com/1537577376).788592799","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1095,"grpc_message":"Exception calling application: [Errno 28] No space left on device","grpc_status":2}"
>
Do you mind to open a new issue until we are sure they are actually the same?
In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:25
mentioned in commit fa84d4b7a88a41060bbb8e8bf1e3827ba5275852
In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:25
mentioned in commit 1fa0412f2a920bc4e4afffec7d6eda5b15ae4e6b
In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:25
mentioned in commit 33411e7cb2dbc63b93f80c2d698ee11de5169292
In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:27
mentioned in merge request !830
In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:28
mentioned in commit 8045e2324ec29f2530cd7eeb80fea81885a40768
In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:28
mentioned in commit 064774a196ffa5ee7ef99b0167e95d0732b207fb
In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 17:28
mentioned in commit 30de0d0b4dd6fb5b957b8f3a7157af98444a9489
In GitLab by [Gitlab user @valentindavid] on Sep 26, 2018, 18:06
I pushed !830 with my fixes.
In GitLab by [Gitlab user @valentindavid] on Sep 28, 2018, 15:42
mentioned in commit 0a332520539d34c8711c23757d4eaa994b5c0e57
In GitLab by [Gitlab user @valentindavid] on Sep 28, 2018, 15:42
mentioned in commit e298af537899a5e43754209d4fe8db273c9148f5
In GitLab by [Gitlab user @valentindavid] on Sep 28, 2018, 15:42
mentioned in commit 4096277263f3f7796de62240b224d002048b910f
In GitLab by [Gitlab user @valentindavid] on Oct 3, 2018, 16:29
[Gitlab user @adds68] has set it up on the fix from !830 on the Freedesktop SDK infra. We will wait and see what happens when it fills up.
In GitLab by [Gitlab user @valentindavid] on Oct 3, 2018, 16:34
The branch deployed on Freedesktop SDK infra is https://gitlab.com/freedesktop-sdk/infrastructure/infrastructure/tree/adamjones/cas-server
It uses commit 2e5ca36fe478de6ce0b15602b1041add1560dea1 of BuildStream
In GitLab by [Gitlab user @jjardon] on Oct 9, 2018, 13:27
marked this issue as related to freedesktop-sdk/infrastructure/infrastructure#23
In GitLab by [Gitlab user @jjardon] on Oct 10, 2018, 12:53
[Gitlab user @valentindavid] fdsk have been tested that branch for a while and we have several jobs getting stuck:
I have checked the CAS server and I notice that:
This is making the jobs to stuck forever, instead failing, which I think is worse. So we are coming back to install bst-1.2.3 in the server for now
In GitLab by [Gitlab user @jjardon] on Oct 10, 2018, 16:38
mentioned in merge request freedesktop-sdk/freedesktop-sdk!582
In GitLab by [Gitlab user @jjardon] on Oct 11, 2018, 16:22
mentioned in commit freedesktop-sdk/infrastructure/infrastructure[Gitlab user @85569b6ac54a8b3192de3bb91ec0abd9686df1ee]
In GitLab by [Gitlab user @jjardon] on Oct 11, 2018, 16:22
mentioned in merge request freedesktop-sdk/infrastructure/infrastructure!25
In GitLab by [Gitlab user @valentindavid] on Oct 22, 2018, 21:10
[Gitlab user @jjardon] Do you have logs from the server?
In GitLab by [Gitlab user @valentindavid] on Oct 22, 2018, 21:13
Found some logs:
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]: ERROR:root:Exception calling application: Method not implemented!
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]: Traceback (most recent call last):
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]: File "/usr/local/lib64/python3.6/site-packages/grpc/_server.py", line 385, in _call_behavior
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]: return behavior(argument, context), True
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]: File "/usr/local/lib/python3.6/site-packages/buildstream/_protos/build/bazel/remote/execution/v2/remote_execution_pb2_grpc.py", line 461, in BatchUpdateBlobs
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]: raise NotImplementedError('Method not implemented!')
Oct 09 04:24:16 CAS-server bst-artifact-server[17482]: NotImplementedError: Method not implemented!
See original issue on GitLab In GitLab by [Gitlab user @jjardon] on Aug 28, 2018, 08:19
What is the current bug behavior?
This is a continuation of #136:
Even with the new cache server implementation (CAS), the cache server gets fill up and It doesn't clean the artifacts automatically
Bugs in downstream projects:
What is the expected correct behavior?
The server should automatically remove artifacts to have enough space
Relevant logs and/or screenshots
https://gitlab.gnome.org/GNOME/gnome-build-meta/-/jobs/92168
From the journal on the server (not messages at all from bst-artifact-server before that):
Other relevant information