Open BuildStream-Migration-Bot opened 3 years ago
In GitLab by [Gitlab user @tristanvb] on Feb 16, 2017, 07:57
Agreed, this is a bit of a corner case and needs a bit of thought.
Currently what will happen is that if two git sources are fetched simultaneously and dont have a cache yet (i.e. initial clone is needed), then they will download into separate temporary directories and the first one to complete will "win".
This should not be fixed with flock(), it's just not worth the possibility of dealing with stale locks to cater to this corner case.
Probably one of the following needs to happen:
In GitLab by [Gitlab user @tristanvb] on Feb 16, 2017, 08:06
Actually scratch the above, another simpler and better solution now comes to mind.
Instead of cloning into a temp directory, we can reduce the race period significantly so that the initial clone would just be:
git init --bare
git remote add origin <url>
git fetch
nowWhat I expect this to achieve is that no time or bandwidth is lost, the second process calling git fetch
on the same repo will rely on git's own locking mechanisms, rendering the second fetch a noop.
In GitLab by [Gitlab user @palvarez89] on Feb 16, 2017, 09:16
I'd need to try this second option (relying on git fetch
) to see how it behaves. This one, though, relies on the filesystem for the state, but I think we can trust git
on this.
In GitLab by [Gitlab user @palvarez89] on Apr 8, 2017, 12:50
I've quickly investigated the "atomic git init
" and then multiple git fetch
, and it didn't work as you thought it would. The second git fetch
will do exactly the same thing as the already-running one, and it won't save any download time.
In GitLab by [Gitlab user @tristanvb] on Apr 11, 2017, 09:16
That's too bad, I did run into this today and it was indeed a bit painful, but this really is a corner case too.
That said, it would still be interesting to handle this better, but I dont think this should be an area of focus for the time being :)
In GitLab by [Gitlab user @franred] on Jun 26, 2018, 16:45
mentioned in commit baserock/definitions[Gitlab user @8e8a7a29dd0d5efd4e8f23f03ce5377e09c75942]
In GitLab by [Gitlab user @franred] on Jun 27, 2018, 13:28
mentioned in commit baserock/definitions[Gitlab user @3e4fe53804019f3918b28a20e10f447e05ff3189]
In GitLab by [Gitlab user @franred] on Jun 28, 2018, 12:50
mentioned in commit baserock/definitions[Gitlab user @8450b79c705d25af217702164ef11fd6c5974b3c]
In GitLab by [Gitlab user @richardmaw-codethink] on Aug 9, 2018, 17:10
This can also be triggered by fetches that include submodules of the same repository.
This is common with Google projects including googletest, and GNU projects including gnulib, so it can happen more likely than you might think.
In GitLab by [Gitlab user @gokcennurlu] on Jul 10, 2019, 16:59
I think potential race conditions for Source.fetch()
still exists. Description in the ticket focused on git
plugin and mentioned the gain in time spent, but this is also very important for plugins that doesn't necessarily have builtin locking mechanisms. It also looks like buildstream can't provide utils for such locking until we make separately started bst processes to have a consensus mechanism. Maybe this ticket is now a wontfix?
If this is correct, should we close it and document this corner case (limitations of source plugins / guarantees provided by buildstream) together with:
- It must be relatively rare that more than one element requires the same source
- Once you have downloaded it once, it should stay cached and pretty much never be fetched twice again
?
If agreed, I suggest https://buildstream.gitlab.io/buildstream/buildstream.source.html#buildstream.source.Source.fetch to add the documentation.
In GitLab by [Gitlab user @tristanvb] on Jul 10, 2019, 20:59
It also looks like buildstream can't provide utils for such locking until we make separately started bst processes to have a consensus mechanism.
It is currently the source plugin's responsibility to either rely on host tooling which supports parallelism or to use atomic operations and temporary subdirectories to ensure there are no clashes.
I think that this should definitely be documented in Source.fetch()
documentation yes, I think we may have some utilities for atomic file writing (and maybe directory swapping ?) which such docs could link to also ?
If this is correct, should we close it and document this corner case...
I'd much prefer to keep it open, even if this bug never gets fixed, it is of value to keep open in the issue database (a bug being open should not be perceived as a negative thing).
I would be happy to tag it as "minor" or such to reflect a low severity, but I rather favor keeping as many issues open as possible, to have a richer bug database of improvements which can potentially be made, no matter how minor they are.
Separately from the above, I would also note that while this appears to be a corner case, it is a particularly annoying one because:
What this adds up to, is that when you start building a minimal full OS from scratch with no sources cached the first time around, you will usually run into a dual fetch of the linux kernel.
This would be a lot less painful of course with #261 resolved (use of shallow git clones).
In GitLab by [Gitlab user @nanonyme] on Oct 7, 2020, 19:18
Regarding this being corner-case: this is actually so severe issue that freedesktop-sdk ended up mirroring basically everything to avoid getting throttled by its sources. Note, this is just about bst1. With bst 2 remote source caching this is indeed a corner case.
I still think this would make sense to resolve but not on the level of lockable parallel fetches. Instead, BuildStream should deduplicate identical sources to fetch only once. I don't consider this a bug report but a feature request in the context of BuildStream 2.
@gtristan it's not common but also not unheard of to define shared yml file for sources and stage that into say five elements each of which build parts of the solution. Eg with gstreamer this is very useful. It would be nice if BuildStream was smart enough to fetch the sources only once.
See original issue on GitLab In GitLab by [Gitlab user @palvarez89] on Feb 14, 2017, 16:57
I've noticed when building https://gitlab.com/tristanvb/buildstream-tests/commits/build-essential examples that
linux.git
was being fetched twice at the same time. Making BuildStream handle this case will save some download time