[feature] In build-order.json, provide build_args even when binary="Download"

nextsilicon-itay-bookstein commented 1 month ago

What is your suggestion?

Hi! I wrote some automation around conan graph build-order. I found myself needing special handling for the case where binary="Download" (as opposed to "Build"). I would like to handle that case and have the package installed in response to that node in the build order, but for that to work properly I find I have to create build_args for myself. In particular I need to encode all the options, because otherwise the package wouldn't match (e.g. fPIC). I think it would be useful to encode build_args in this case as well, unless you have better practice to suggest.

Thanks, Itay

Have you read the CONTRIBUTING guide?

[X] I've read the CONTRIBUTING guide

nextsilicon-itay-bookstein commented 1 month ago

For the record, I ended up doing something along these lines:

def _build_args_for_download(ref: str, package: Any) -> str:
    """
    Adapted from conans.client.graph.install_graph._InstallPackageReference._build_args.
    """
    assert package["binary"] == "Download"

    context: str = package["context"]
    cmd = f"--requires={ref}" if context == "host" else f"--tool-requires={ref}"
    cmd += f" --build=never"

    options: list[str] = package["options"]
    if options:
        cmd += " " + " ".join(f"-o {o}" for o in options)

    overrides: dict[str, list[str]] = package["overrides"]
    if overrides:
        cmd += f' --lockfile-overrides="{overrides}"'
    return cmd

This is basically the same as "Build", except I modified to --build=never

memsharded commented 1 month ago

Hi @nextsilicon-itay-bookstein

Thanks for your suggestion.

I am not sure I understood the flow. What operation are you doing on the packages that are marked with Download? Are you executing a conan download? Or a conan install ...? For what purpose? I'd like to understand a bit better the root cause, it feels a bit incorrect to have build_args for something that is not being built, in any case it should be install_args?

nextsilicon-itay-bookstein commented 1 month ago

Sure, I'll elaborate. I have a workflow where I'm configuring both conancenter and myremote as remotes, with the latter starting out empty or partial. I then run conan graph build-order, which downloads all recipes from conancenter as a side-effect. Then I iterate over its output to build the packages in parallel. Some packages are not Build, but are only Download (those I have already built, or header-only libraries, or whatever else). So, when constructing the arguments for the conan install command for each package, I do the following:

async def install_recipe_package(
    self, params: ConanParams, ref: str, recipe_dir: Path, package: Any
):
    binary: str = package["binary"]
    match binary:
        case "Build":
            build_args: str = package["build_args"]

        case "Download":
            build_args = self._build_args_for_download(ref, package)

        case "Cache":
            return

        case _:
            raise ValueError(f"Unknown binary type: {binary}")

    assert build_args, "build_args missing"

    pkg_id: str = package["package_id"]
    pkg_dir = recipe_dir / pkg_id
    pkg_dir.mkdir()

    icmd = [
        "conan",
        "install",
        *params.args,
        "--format=json",
        *shlex.split(build_args),
        f"--output-folder={str(pkg_dir)}",
    ]

Where _build_args_for_download is as above. This is required as otherwise the package_id won't match.

memsharded commented 1 month ago

Then I iterate over its output to build the packages in parallel. Some packages are not Build, but are only Download (those I have already built, or header-only libraries, or whatever else). So, when constructing the arguments for the conan install command for each package, I do the following:

I see, you are building in parallel those packages, in the same machine using the same Conan cache, then you were being hit by the Conan cache not being concurrent-safe at the moment, is this the issue? Because if the cache were concurrent or you were using blank caches for every job, then the download will be automatic and safe, no need to do such extra manual conan install to download that binary without building it.

If that is the case, I understand now your approach. We'd generally recommend to have independent blank caches for parallel builds in CI, but if you can control the way things are downloaded then at least I understand the intention.

Maybe there are some other more effective approaches? As the graph build-order contains necessary information, it would be relatively easy to assemble a "Package List" with all the references, then conan download --list=mypkglist.json before launching the concurrent builds? I think with a conan graph info --graph=json > mygraph.json + conan list --graph=mygraph.json --graph-binaries=Download, it should be possible to do it in a couple of steps.

nextsilicon-itay-bookstein commented 1 month ago

Wait, I'm not 100% sure I follow you. I do use build-order to parallelize, but I parallelize both builds and downloads (i.e. I parallelize installs). And I do use a shared cache, so I only parallelize inside a single "level" within the build-order - the packages for each recipe are built serially, but all recipes in a single "level" are processed in parallel. In code this is:

async def install_recipe_packages(
    self, sem: asyncio.Semaphore, g: ConanGraphParams, recipe_ref: Any
):
    async with sem:
        rref = recipe_ref["ref"]
        recipe_dir = self._tmpdir / rref.replace("/", "_")
        recipe_dir.mkdir(exist_ok=True)

        # We forgo parallelizing across levels per recipe; we don't expect
        # to be building many different packages for each RREV (mostly 
        # one for the host context and sometimes an additional one for
        # the build context).
        for level in recipe_ref["packages"]:
            for package in level:
                await self.install_recipe_package(g, rref, recipe_dir, package)

async def build_level_in_parallel_async(
    self, g: ConanGraphParams, level: list[Any]
):
    pjobs = self._package_jobs if self._package_jobs else _DEFAULT_PACKAGE_JOBS
    sem = asyncio.Semaphore(pjobs)
    tasks = [
        self.install_recipe_packages(sem, g, recipe_ref) for recipe_ref in level
    ]
    await asyncio.gather(*tasks)

def build_level_in_parallel(self, g: ConanGraphParams, level: list[Any]):
    asyncio.run(self.build_level_in_parallel_async(g, level))

def build_graph_in_parallel(self, g: ConanGraphParams):
    lock_args = (
        ["--update=", "--lockfile=", f"--lockfile-out={str(g.lockfile)}"]
        if self._update
        else [f"--lockfile={str(g.lockfile)}"]
    )
    build_order_cmd = [
        "conan",
        "graph",
        "build-order",
        "--format=json",
        "--order-by=recipe",
        f"--build={self._build}",
        *lock_args,
        *g.args,
        str(g.conanfile),
    ]
    build_order_file = self._tmpdir / f"{g.lockfile.stem}_build_order.json"
    with build_order_file.open("wb") as bof:
        _print_cmd(
            cmd=build_order_cmd,
            env_modifications=g.env_modifications,
            output_path=build_order_file,
        )
        subprocess.check_call(build_order_cmd, env=g.env, stdout=bof)

    with build_order_file.open("rb") as bof:
        build_order = json.load(bof)

    # The json is already subdivided into "levels". We can forgo understanding
    # the dependencies and parallelize according to the levels at the cost of
    # potentially lower utilization.
    for level in build_order["order"]:
        self.build_level_in_parallel(g, level)

The end goal is that at the end of running this code, the Local Cache is fully populated with everything we require, even if everything was tagged "Download" and not "Build", and even if some header-only packages were downloaded from conancenter and not our local repository. At this point, building our consuming project should be possible completely within the cache.

Optionally, afterwards, the packages in the Local Cache are uploaded to our local repository, whether they were built, downloaded from conancenter, or downloaded from the local repository itself (in which case the upload will be elided). That is done via conan graph info + conan list --graph, to create a pkglist that corresponds to the build-order. This ensures that our local repository always has all packages required for the build, even if they were originally downloaded from conancenter.

We'll eventually drop conancenter from this flow using the recommended approach of forking conan-center-index and having CI upload all recipes from that fork to our local repository, but until then, this approximates that quite well.

I gather from what you're saying that the extra logic of synthesizing conan install args for the "Download" cases could instead be done via a single large conan download on the above pkglist, which would then allow me to pass --reduce to conan graph build-order.

Reading your reply, am I to understand that there's some conflict for downloads vs builds in terms of parallelization within the same cache? Is that true even when I restrict the parallelism as I described above? Just to clarify - I didn't write the convert-download-to-conan-install due to parallelism problems, I made the handling of each recipe serial to solve that, only parallelizing between recipes. That also made me use --order-by=recipe instead of --order-by=configuration (although I don't think I fully understand the differences between the two, and the documentation didn't explain).

nextsilicon-itay-bookstein commented 3 weeks ago

Also, I notice I'm spuriously getting an sqlite3 "Database is locked" error, with low probability, with this implementation strategy. Is there something I should do to avoid it, or is the only solution to break everything out to separate caches?

nextsilicon-itay-bookstein commented 3 weeks ago

I will add that I ended up throwing away the per-package download logic. I ended up using conan graph info + conan list + conan download for all remotes that aren't the remote that I intend to upload to.

memsharded commented 3 weeks ago

The end goal is that at the end of running this code, the Local Cache is fully populated with everything we require, even if everything was tagged "Download" and not "Build", and even if some header-only packages were downloaded from conancenter and not our local repository. At this point, building our consuming project should be possible completely within the cache.

Take into account that for many production environments, using ConanCenter directly in prod is not very recommended, see https://docs.conan.io/2/devops/using_conancenter.html

We'll eventually drop conancenter from this flow using the recommended approach of forking conan-center-index and having CI upload all recipes from that fork to our local repository, but until then, this approximates that quite well.

Ok, that is what I meant, good to know.

Reading your reply, am I to understand that there's some conflict for downloads vs builds in terms of parallelization within the same cache? Is that true even when I restrict the parallelism as I described above?

Not a conflict between downloads and builds, but the builds need the dependencies download first. Downloading in parallel the same package can cause race conditions, so building in parallel independent packages can still produce race conditions if they require to download the same dependency. This is why I was thinking the possibility to try to remove the possibility of a race condition by downloading everything that needs to be downloaded first, then start the builds, and not interleave downloads and builds.

Just to clarify - I didn't write the convert-download-to-conan-install due to parallelism problems, I made the handling of each recipe serial to solve that, only parallelizing between recipes. That also made me use --order-by=recipe instead of --order-by=configuration (although I don't think I fully understand the differences between the two, and the documentation didn't explain).

Order by recipe is designed to build every different binary configuration of the same package before proceeding to the next level. That is, if openssl depends on zlib, and our build does Windows, Linux and Mac, the recipe order is to build first the Windows, Linux and Mac binaries, and only after all of them have been done, then proceed to build openssl.

The order by configuration is designed to build different configuration in parallels. In this openssl -> zlib example, it would be possible that the Linux build of both zlib and openssl finished before the build of zlib in Windows even started. If the builds are generally successful this approach can have better usage of resources as it has less "waiting for the other configuration to finish" bottlenecks. If the builds fail too often in one configuration but not others, it could be wasting more resources.

memsharded commented 3 weeks ago

Also, I notice I'm spuriously getting an sqlite3 "Database is locked" error, with low probability, with this implementation strategy. Is there something I should do to avoid it, or is the only solution to break everything out to separate caches?

We increased the limits of timeouts in sqlite3 to something very high, we didn't expect this to keep happening, it might be that the concurrency on the sqlite3 is too high. You might try to patch Conan and increase the timeouts in:

    def db_connection(self):
        assert self._lock.acquire(timeout=10), "Conan failed to acquire database lock"
        connection = sqlite3.connect(self.filename, isolation_level=None, timeout=10)

From 10 seconds to something higher, but I think that if the DB is already blocked for 10 seconds for an operation, increasing the limits will only make the traffic jams larger, but maybe not solve it. We still need to look to possible optimization patterns in the code accessing the DB, but I don't know if there is much optimization space there.

memsharded commented 3 weeks ago

will add that I ended up throwing away the per-package download logic. I ended up using conan graph info + conan list + conan download for all remotes that aren't the remote that I intend to upload to.

That sounds what I was suggesting. Is this approach working well then?

nextsilicon-itay-bookstein commented 3 weeks ago

That sounds what I was suggesting. Is this approach working well then?

Yup, seems that way, thanks. It was a bit cumbersome to regenerate the graph+list after these downloads, but now that I did it, it seems to be working fine.

We increased the limits of timeouts in sqlite3 to something very high, we didn't expect this to keep happening, it might be that the concurrency on the sqlite3 is too high. You might try to patch Conan and increase the timeouts in:
    def db_connection(self):
        assert self._lock.acquire(timeout=10), "Conan failed to acquire database lock"
        connection = sqlite3.connect(self.filename, isolation_level=None, timeout=10)
From 10 seconds to something higher, but I think that if the DB is already blocked for 10 seconds for an operation, increasing the limits will only make the traffic jams larger, but maybe not solve it. We still need to look to possible optimization patterns in the code accessing the DB, but I don't know if there is much optimization space there.

Odd, we are still encountering that with low probability on contended runs.

Not a conflict between downloads and builds, but the builds need the dependencies download first. Downloading in parallel the same package can cause race conditions, so building in parallel independent packages can still produce race conditions if they require to download the same dependency. This is why I was thinking the possibility to try to remove the possibility of a race condition by downloading everything that needs to be downloaded first, then start the builds, and not interleave downloads and builds.

This sounds reminiscent of errors I encountered when using --order-by=configuration. Running each recipe serially may have hedged against that, but probably only coincidentally -- the same recipe in multiple configurations is highly likely to require the same dependency in those configurations, but nothing guarantees that different recipes won't still require the same dependency.

So, to summarize, it seems to me that the recommended solution for both issues (download conflicts and sqlite3 timeouts), until parallelization is implemented first-class by Conan, is to break each build out to a separate cache? Still, I'm not sure how to populate each of these separate caches with the non-downloaded dependencies that were locally built just beforehand? Do I have to do the following?

Serial-For each level in the build graph:
- conan cache save the "main" cache
- Parallel-For each configuration in the level:
  - conan cache restore the current "main" cache into a private one
  - Build the package
  - conan cache save the current "main + pkg" cache
- conan cache restore each "main + pkg" cache back into the "main" cache

Is there something better to do here with respect to disk IO load, CPU compression load, and overall simplicity of implementation?

nextsilicon-itay-bookstein commented 3 weeks ago

It occurs to me that conan graph + conan list + the filter arguments to conan cache save/restore can be used to minimize data transport (at the cost of having a separate save of the "main" cache for the requirements of each package being built), but I wonder if there's something simpler I'm missing.

memsharded commented 3 weeks ago

Odd, we are still encountering that with low probability on contended runs.

I think some distros like RH6 or the like had something in the Python installation of sqlite that increased the likelihood, but we thought with the new 10 seconds timeout was more than enough for the vast majority of cases.

I don't know enough about your use case, I guess that the CI and dependencies would be pretty large, otherwise maybe this could be a bit too much of an optimization? I mean, if the server is located quite close (or even in the same cloud) of the CI, then transfer is super fast, so this wouldn't be much of an issue, and uploading packages to servers is the easiest approach. There are also other mechanisms like the "download cache" that can be shared across different caches in the same machine, that can save most of the extra transfers, just taking more disk space and unzipping time, but not transfers.

It occurs to me that conan graph + conan list + the filter arguments to conan cache save/restore can be used to minimize data transport (at the cost of having a separate save of the "main" cache for the requirements of each package being built), but I wonder if there's something simpler I'm missing.

Not really, the conan cache save/restore requires good knowledge of what is being built and where, and what needs to be moved from one cache to another, so it makes sense that to be efficient, it has to be correctly captured from other operations the lists of packages to be saved/restored.

memsharded commented 3 weeks ago

See conf core.download:download_cache for the download cache. It lacks some more explicit docs I think, but it is there and can be used.

conan-io / conan

[feature] In build-order.json, provide build_args even when binary="Download" #16798

What is your suggestion?

Have you read the CONTRIBUTING guide?