flatpak / flatpak-builder-tools

Various helper tools for flatpak-builder
192 stars 105 forks source link

Allow using binary wheels instead of building from source #296

Open real-yfprojects opened 2 years ago

real-yfprojects commented 2 years ago

flatpak-builder version

1.0.10

Linux distribution and version

Ubuntu 20.04

Affected flatpak-builder tool

pip/flatpak-pip-generator

flatpak-builder tool cli args

No response

Source repository URL

No response

Flatpak-builder manifest URL

No response

Description

When generating a dependency file for the python package cryptography from pypi.org, the flatpak-pip-generator recognised that the package only supplies platform dependent wheel distributions and therefore includes the source distribution . However the source distribution can only be build by compiling complicated rust dependencies and building the app with flatpak-builder fails therefore. I solved this by replacing the source distribution with all the different wheel packages available so that the right one is always available. Why can't flatpak-pip-generator do something similar?

PS: Which wheel distributions do I actually need for building the flatpak on aarch64 and x86?

TingPing commented 2 years ago

The tool is just very conservative. Source builds, when they actually build, will always work. Wheel binaries depend on host packages which are unknown and may not work.

This has been discussed elsewhere but never done. A flag to simply say --use-wheels-for=cryptography where you can override this would be reasonable.

johannesjh commented 2 years ago

This would be awesome for installing packages like numpy, scikit-learn or pandas as binary wheels since these are difficult to build+install from source packages (because of their many compile time dependencies).

What would the output of flatpak-pip-generator look like? In case the wheels are platform-specific, I guess the output would have to specify multiple wheels, one for each target architecture, right?

How could this output be generated? The flatpak-pip-generator script currently calls pip to resolve packages. It could be possible to resolve packages for a specific target architecture by passing the --platform parameter to pip.

real-yfprojects commented 2 years ago

What would the output of flatpak-pip-generator look like? In case the wheels are platform-specific, I guess the output would have to specify multiple wheels, one for each target architecture, right?

You simply add all wheels needed to sources and specify their platform using only_arches.

real-yfprojects commented 2 years ago

How could this output be generated? The flatpak-pip-generator script currently calls pip to resolve packages. It could be possible to resolve packages for a specific target architecture by passing the --platform parameter to pip.

It is also easily possible to query the pypi for a list of wheels available for a given package. In fact this is already implemented in the script.

johannesjh commented 1 year ago

What would the output of flatpak-pip-generator look like? In case the wheels are platform-specific, I guess the output would have to specify multiple wheels, one for each target architecture, right?

You simply add all wheels needed to sources and specify their platform using only_arches.

sounds good! I'm trying to think this through and make it as specific as possible. can you provide an example of how you would structure the flatpak json? a separate module for each platform-specific wheel, right?

It is also easily possible to query the pypi for a list of wheels available for a given package. In fact this is already implemented in the script.

yes, querying pypi is the easy part... resolving the right wheel filename to install is the hard part. for this, the flatpak-pip-generator script currently relies on the pip download command to choose the right filename.

How then would we solve the hard part?, i.e., how can we choose the right filename, the best wheel to install?

real-yfprojects commented 1 year ago

sounds good! I'm trying to think this through and make it as specific as possible. can you provide an example of how you would structure the flatpak json? a separate module for each platform-specific wheel, right?

This is how I did it. Right now pip will choose the correct wheel when running the build process. Some of the wheels might not be needed for any build. In fact only two of them should be needed since the flatpak build server only runs on two platforms. The x-checker-data field is not used since the bot for that can't check platform dependent wheels afaik.

{
  "name": "python3-secretstorage",
  "buildsystem": "simple",
  "build-commands": [
    "pip3 install --verbose --exists-action=i --no-index --find-links=\"file://${PWD}\" --prefix=${FLATPAK_DEST} \"secretstorage\" --no-build-isolation"
  ],
  "sources": [
    {
      "type": "file",
      "url": "https://files.pythonhosted.org/packages/79/b2/78bd6b9705296a8030c398619c9dedaa0724199be800955a7c18a1e6a3ba/scikit_learn-1.1.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
      "sha256": "33cf061ed0b79d647a3e4c3f6c52c412172836718a7cd4d11c1318d083300133",
      "only-arches": ["aarch64"]
    },
    {
      "type": "file",
      "url": "https://files.pythonhosted.org/packages/43/bc/7130ffd49a1cf72659c61eb94d8f037bc5502c94866f407c0219d929e758/scikit_learn-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
      "sha256": "47464c110eaa9ed9d1fe108cb403510878c3d3a40f110618d2a19b2190a3e35c",
      "only-arches": ["x86_64"]
    },
    {
      "type": "file",
      "url": "https://files.pythonhosted.org/packages/58/be/06987c1268a5c6beea0fea7b3c25eb52839fa23693ab2f92b80721d78554/scikit_learn-1.1.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
      "sha256": "e851f8874398dcd50d1e174e810e9331563d189356e945b3271c0e19ee6f4d6f",
      "only-arches": ["aarch64"]
    },
    {
      "type": "file",
      "url": "https://files.pythonhosted.org/packages/72/7d/cbcad2588a4baf1661e43005a9c35a955ab38e247a943715d90a7c96e6b3/scikit_learn-1.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
      "sha256": "b928869072366dc138762fe0929e7dc88413f8a469aebc6a64adc10a9226180c",
      "only-arches": ["x86_64"]
    },
    {
      "type": "file",
      "url": "https://files.pythonhosted.org/packages/21/f1/08f5e313c028bfce28abc068ba5b6633ed95b767441b6e5271249ae65601/scikit_learn-1.1.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
      "sha256": "8ff56d07b9507fbe07ca0f4e5c8f3e171f74a429f998da03e308166251316b34",
      "only-arches": ["aarch64"]
    },
    {
      "type": "file",
      "url": "https://files.pythonhosted.org/packages/62/cb/49d4c9d3505b0dd062f49c4f573995977876cc556c658caffcfcd9043ea8/scikit_learn-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
      "sha256": "c2dad2bfc502344b869d4a3f4aa7271b2a5f4fe41f7328f404844c51612e2c58",
      "only-arches": ["x86_64"]
    }
  ]
}
real-yfprojects commented 1 year ago

How then would we solve the hard part?, i.e., how can we choose the right filename, the best wheel to install?

I think Option B is the way to go since that only adds code for passing the options --platform, --python-version, --implementation and --abi. However we still need to know what to pass for these options since the script should return the (almost) same results on every machine.

johannesjh commented 1 year ago

This is how I did it.

great, thank you! so basically you downloaded a list of all wheels, filtered the list for the linux platform with desired architecture, and added the wheels as sources to the manifest. later on, during the build, pip install will pick whatever wheel it considers to be best from the available sources. that's clever, clean and easy, I like this solution!

johannesjh commented 1 year ago

However we still need to know what to pass for these options

draft:

I continued thinking and researching about how to filter the list of wheel filenames (from pypi json), so we only include roughly suitable candidates as sources in the build manifest:

real-yfprojects commented 1 year ago
* quick and dirty filtering using regexes could be used to blacklist strings such as `macos` or `win`. should be enough to get started.

A regex like win|macos?

In our case, the first list of tags would come from packaging.tags.sys_tags; this is a list of tags that are compatible with the current system, in preferential order.

Why should we use the current system version. Flatpaks are usually build on the buildbot.flathub.org server which probably uses a different version. Or maybe the build process uses a python version bundled in the flatpak's Runtime.

The second list of tags would come from parsing a wheel's filename from the pypi .json using packaging.tags.parse_tag;

I think packaging.utils.parse_wheel_filename(filename) would be the right function for that.

johannesjh commented 1 year ago

A regex like win|macos?

yes

Why should we use the current system version.

you are right, it cannot always be used.

...I edited my above posting to better reflect this.

I think packaging.utils.parse_wheel_filename(filename) would be the right function for that.

yes, even better, thank you! (I had overlooked this function)

johannesjh commented 1 year ago

Pseudocode draft

# in the flatpak-pip-generator-script, 
# for all packages specified in the "--use-wheels-for" commandline argument
1. the script retrieves a package's .json description by fetching it from pypi,
   by querying, e.g., https://pypi.org/pypi/scikit-learn/json
2. within the package .json, the script looks up the right release (with the right package version number)
3. within the release .json, the script retrieves the list of filenames
4. the script filters the list of filenames using these criteria:
  a. discard the filename if it is not a wheel, e.g., by matching the filename against `\.whl$`
  b. discard the filename if none of its PEP425 tags can be found in the list of tags that are acceptable for the target runtimes.
5. the filtered list of wheel filenames is included in the output of flatpak-pip-generator, 
  as source files for the package.

(Note: I wonder if the above logic based on pypi .json instead of downloaded wheels could also be used in the main logic of the flatpak-pip-generator script. The script currently downloads binary wheels, only to replace them with source packages later on... we could skip the initial download. But this would maybe take things too far, and make it difficult for maintainers to accept the pull request, if the change is too big. Better start small).

real-yfprojects commented 1 year ago

@TingPing Why does the flatpak-pip-generator script use pip to determine the packages needed? Is the sole reason resolving dependencies?

real-yfprojects commented 1 year ago
# in the flatpak-pip-generator-script, 
# for all packages specified in the "--use-wheels-for" commandline argument

Maybe --force-wheels-for since the script already uses wheels if they are platform independent.

  1. the script retrieves a package's .json description by fetching it from pypi, by querying, e.g., https://pypi.org/pypi/scikit-learn/json

Before that it should download all the packages and dependencies using pip. Then it determines which package versions are needed (and whether the packages are required in the first place).

real-yfprojects commented 1 year ago

I think for these changes it would also make sense to move more code into functions in the script and add a if __name__ == '__main__': section.

real-yfprojects commented 1 year ago

@TingPing Why does the flatpak-pip-generator script use pip to determine the packages needed? Is the sole reason resolving dependencies?

Can you answer this @TingPing or ping anyone that was involved in coding the flatpak-pip-generator?

real-yfprojects commented 1 year ago

@johannesjh I started working on this but I don't find the code and some of the choices made easy to understand. Especially the duplication of calculations bothers me.

johannesjh commented 1 year ago

Yes me too. I am torn between patching what exists vs. attempting a lean rewrite.

A) Patching what exists: I guess this would mean to keep using pip download because it offers dependency resolution, an existing users probably rely on it. But we can avoid multiple calls to pip download; these are not necessary in my opinion. It would be great if one of the maintainers could confirm... but in my opinion:The second call to pip download, where the script says it is downloading source packages, can be omitted and merged with the first call. How can this be done? - pip supports commandline options (and requirements.txt files support these options as well), allowing to specify a preference for binary vs source packages.

B) Attempting a lean rewrite: in a rewritten script, we could omit dependency resolution because other tools like pip freeze, pip-compile or pip-env are better at it. In other words, the script should expect a complete dependency tree with frozen package versions as input. This would reduce the script to the collection of download urls for one or multiple target platforms, and to the formatting of json/yaml output.

real-yfprojects commented 1 year ago

Attempting a lean rewrite: in a rewritten script, we could omit dependency resolution because other tools like pip freeze, pip-compile or pip-env are better at it.

We still can use some of the code in the current code as a reference, e.g. for handling vcs dependencies. AFAIK pip freeze can only create a requirements file from a python environment. It isn't able to resolve dependencies. The same goes for pip-compile. pipenv graph can give us a nice dependency tree. However they use pipdeptree in the background which we can use directly to reduce dependencies. The problem with these approaches is that one always needs to create a complete virtual python environment when the script should be run. Ideally the script should be as short as possible, have as little dependencies as possible and be as fast as possible.

real-yfprojects commented 1 year ago

We could use pip's dependency resolver which is a very low level requirement for our script. The implementation of pip download can serve as a reference:

https://github.com/pypa/pip/blob/8a51fe790ae4df0601a9e6b7db3612e858cd1637/src/pip/_internal/commands/download.py#L77

Since the pip API is well written, this is a valid approach. I also found pipgrip which is the kind of tool we need. Though there are some known caveats. They link resolvelib which pip uses a fork of. If resolvelib serves our needs we should look whether we can use the resolvelib incorporated into pip.

johannesjh commented 1 year ago

I fully agree that the script should be kept small and simple. Hence the idea for a lean rewrite... and that's why I am saying, we could outscope dependency resolution if this makes the script simpler. To explain again, there are many solutions for how users may want to resolve and freeze their dependencies. Some of these solutions also install the packages or require them to be installed by the user (e.g., pip freeze and pip-env), others don't (e.g., pip-compile), we really don't care: in a lean rewrite, we could take a fully resolved dependency tree with frozen package versions as input. The flatpak-poetry-generator script shows how short and simple the script could be in this approach.

(Alternatively, if you feel that we should keep dependency resolution as a feature of the script... then yes, we could use a package for that instead of calling pip download)

real-yfprojects commented 1 year ago

(Alternatively, if you feel that we should keep dependency resolution as a feature of the script... then yes, we could use a package for that instead of calling pip download)

I get your point. The script should expect a pinned requirements file (with the dependencies also) then. Still you should consider adding an extra script for getting such a requirements file since many people still either maintain no requirements file at all or one with the top-level dependencies only.

Alternatively we could document the easiest way for obtaining such a pinned requirements file in the README.

johannesjh commented 1 year ago

I created some code as a prototype for replacing the usage of pip download in situations where the flatpak-pip-generator script does not actually need to download a package, but only needs to find out the right download url and hash to write into the manifest.

The prototype allows to write, for example:

def main():
    # define the target platform(s)
    gnome42_x86 = PythonInterpreter(3, 9, Architecture.x86_64)

    # define requirements (could be read from requirements.txt or other formats)
    pandas = Release("pandas", "1.4.4")

    # select the best wheel from pypi, for a given target architecture:
    wheel = pandas.wheel_for(gnome42_x86)

    # alternatively, select the source package
    sdist = pandas.sdist()

    # we now have all the data needed to write a flatpak manifest 
    pprint(wheel)
    pprint(sdist)

...which will print:

Download(filename='pandas-1.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl',
         url='https://files.pythonhosted.org/packages/91/2e/f2e84148e71dda670b310f1f7b9a220181e5dc1fe2f9dcf6a8632412bf4e/pandas-1.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl',
         sha256='a981cfabf51c318a562deb4ae7deec594c07aee7cf18b4594a92c23718ec8275')
Download(filename='pandas-1.4.4.tar.gz',
         url='https://files.pythonhosted.org/packages/1a/3f/bba4f9e41fff332415cdb08063b78a53c813aba1ac02887944657bb30911/pandas-1.4.4.tar.gz',
         sha256='ab6c0d738617b675183e5f28db32b5148b694ad9bba0a40c3ea26d96b431db67')

...implemented in less than 200 lines of code.

using just one dependency, pypa's packaging package, that is also used internally in pip.

note about implementation choices: I could alternatively have used pip's internal apis to implement the same thing, i.e., using the implementation of pip download as a reference, as suggested in the discussion above. but pip's developers strongly discourage using pips internal apis. I followed their recommendation. And I used apis of pypa's packaging package instead.

next steps:

I uploaded the full code in this gist: https://gist.github.com/johannesjh/2da0ffdc5458fd46b6c32dc7e84e4d30

@real-yfprojects what do you think about it?

real-yfprojects commented 1 year ago

Very nice although I wouldn't have used OOP for a script since the well defined API introduces quite some unused code.

johannesjh commented 1 year ago

I extended the prototype, published as a new version of the gist at https://gist.github.com/johannesjh/2da0ffdc5458fd46b6c32dc7e84e4d30. It can now parse requirements.txt files, and it produces flatpak build manifests as output.

About the OOP programming style: The script has now grown to over 400 lines... and yes, the object-oriented programming style makes it rather verbose. Hopefully also well-structured and easier to maintain, because: When I started writing the script, I originally also wanted a more straightforward procedural style, but things got complex so I wanted strictly typed data structures, so I started using dataclasses and continued to structure the program in a more object-oriented way, that's how it came to be.

About unused code: can you be more specific? If there is unused code, we can of course remove it.

example 1, simple commandline invocation:

python3 req2flatpak.py pandas==1.4.4 -t 39-linux-x86_64 39-linux-aarch64

example 2, invocation from a pyhon script... this opens up many possibilities for customization if needed:

from req2flatpak import FlatpakGenerator, PythonInterpreter, Arch, RequirementsTxtParser, PyPi

if __name__ == "__main__":
    # example demonstrating how to invoke req2flatpak from a python script:
    gnome42_x86 = PythonInterpreter(major=3,minor=9,arch=Arch.x86_64)
    gnome42_aarch64 = PythonInterpreter(major=3,minor=9,arch=Arch.aarch64)
    generator = FlatpakGenerator(interpreters=[gnome42_x86, gnome42_aarch64])
    reqs = RequirementsTxtParser.parse_string("""
    pandas == 1.4.4 
    """)
    output = generator.buildmodule_as_json(reqs)
    print(output)

the examples produce the following output:

{
    "name": "python3-package-installation",
    "buildsystem": "simple",
    "build-commands": [
        "pip3 install --verbose --exists-action=i --no-index --find-links=\"file://${PWD}\" --prefix=${FLATPAK_DEST} --no-build-isolation pandas"
    ],
    "sources": [
        {
            "type": "file",
            "url": "https://files.pythonhosted.org/packages/91/2e/f2e84148e71dda670b310f1f7b9a220181e5dc1fe2f9dcf6a8632412bf4e/pandas-1.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "sha256": "a981cfabf51c318a562deb4ae7deec594c07aee7cf18b4594a92c23718ec8275",
            "only-arches": [
                "x86_64"
            ]
        },
        {
            "type": "file",
            "url": "https://files.pythonhosted.org/packages/3f/ea/c80181902a2c9c15f796a0c729ca730052c5d95bfdc3689ad477e15f75d1/pandas-1.4.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
            "sha256": "9d2a7a3c1fea668d56bd91edbd5f2732e0af8feb9d2bf8d9bfacb2dea5fa9536",
            "only-arches": [
                "aarch64"
            ]
        }
    ]
}
johannesjh commented 1 year ago

what is still missing are heuristics/filters/rules for customizing the choice of download packages. for example:

My fear is that such needs will blow up the script's complexity if we start building commandline options for each and every customization need. I think it will be better to ask users to write their own python code. Default usage would not be difficult, as demonstrated in the above python code example, and advanced users could add their own customizations, e.g. in order to filter packages, to implement additional package indices, etc. So I guess it makes sense to review the script's programming API from this perspective.

johannesjh commented 1 year ago

i found a bug. the logic in my implementation of _linux_platforms is flawed... the functions from packaging.tags that I am calling rely on code that needs to run on the target machine, e.g., code that determines glibc versions. this has rather large implications about how to get a correct list of supported platform tags:

real-yfprojects commented 1 year ago

I think it makes sense to publish this as a package on pypi. The code could then be split up into multiple files so that one could still download the core functionality as a script. However this might not be the right repository for maintaining such a package.

Btw. the OOP approach still needs 20 lines less than the current script.

About unused code: can you be more specific? If there is unused code, we can of course remove it.

Actually there is only little. Release.sdist and the two __eq__ methods aren't used.

or else users of the script would have to provide the list of targeted platform tags as data when running the script. that would be rather tedious.

It would be sufficient if the user provides the musl version(s) of the target platform(s). A maximum glibc version is only needed for generating a (finite) list of compatible versions.

johannesjh commented 1 year ago

Big news, I published the req2flatpak script as a new project on github.

Some notes on current progress and next steps:

Code structure... I spent some time refactoring the script. I think that the code structure is now easier to understand. The data classes are mostly pure data, almost no functionality. The behavior is implemented a procedural way... but I still use classes to group related methods together. Programmatic use of the script now boils down to:

platforms = [PlatformFactory.from_string("310-x86_64")]
requirements = RequirementsParser.parse_file("requirements.txt")
releases = PypiClient.get_releases(requirements)
downloads = {
    DownloadChooser.wheel_or_sdist(release, platform)
    for release in releases
    for platform in platforms
}
manifest = FlatpakGenerator.manifest(requirements, downloads)

...the above code shows how easy and straightforward it is to programmatically use the req2flatpak script, with the benefit that each of these steps can be customized if needed. A basic commandline (CLI) interface is still provided in the req2flatpak script, to make it easy to get started.

About the list of platform tags that we discussed in the above two comments: This is resolved now. I wrote a method PlatformFactory.from_python_version_and_arch(...) to generate a list musl linux platform tags. The method is independent from the current python interpreter and system; this is its main advantage over packaging.tags.sys_tags. The tags returned by this method are an approximation, trying to match what packaging.tags.sys_tags would return if invoked on a musl linux system with cpython. The approximation worked really well: The method returns the exact same tags as if running packaging.tags.sys_tags on org.gnome.Platform//43, on both x86_64 and aarch64 architectures.

An update on my first practical experience with the new req2flatpak script: I personally started using req2flatpak in a first project, in favagtk. This was successful: The req2flatpak script reads a requirements.txt file with over 60 python packages and generates a flatpak build module. Some of the packages like scikit-learn and numpy are notoriously difficult to install from sdist (e.g., when using flatpak-pip-generator), but req2flatpak chooses suitable wheels. This makes the package installation easy and much faster.

Retrospective and next steps: What started as a feature request and prototype has turned into a separate project that I named req2flatpak. So I guess this is a moment for saying thank you and good bye. Thank you @real-yfprojects for your constructive help and feedback. And thank you to the maintainers of flatpak-builder-tools for the prior work, and for hosting this conversation up until now. As for the future, let's stay friends! I am explicitly saying this because sometimes maintainers don't like to see diverging projects or forks... should this be the case, feel free to contact, I am open to many ways of cooperation, including the possibility of contributing req2flatpak into the popular flatpak-builder-tools project, to the benefit of many.

real-yfprojects commented 1 year ago

Great @johannesjh! I will do a complete code review soon.

Fingel commented 2 months ago

Still running into this in 2024 it seems. Had a transitive dep on Pydantic2, which is a very popular library now. Unfortunately the source dist requires a rust toolchain. Having to manually edit in the appropriate wheel is not ideal.