Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.62k stars 2.83k forks source link

Merge all packages in a single pypi package (for real) #815

Closed Fale closed 6 years ago

Fale commented 8 years ago

Having tens of pypi packages that are kind of united but not really makes it very difficult to package it in distros, please fix it (also because due to how python includes works there is no real advantage in doing a huge amount of small packages)

glaubitz commented 7 years ago

That doesn't work because the azure-cli releases depend on the individual components of the SDK, not on the SDK as a whole.

Sorry, I wasn't clear enough then. I was not talking about creating a single RPM package, but to use the github tarball as a single source. Not because I particularly prefer github over PyPi but rather because the packages on PyPi are either outdated or broken.

a.) Create 1 package for SDK, as we pretty much do in openSUSE right now and then have a very long list of Provides: statements where each Provides lists a component. This list is going to be a PITA to maintain and will inevitably be wrong and cause headachs

I agree and that's definitely not what I want. However, having to pull every

b.) Package each individual component of the SDK, the approach we are now taking.

That's definitely what I want to do. However, my problem currently is that I don't know for sure which set of packages I should use.

Should I:

a) Use the v2.0.0rc6.tar.gz as the source for all base packages and create RPMs from that? I have written a small script which creates the individual .zip files for all individual packages. Then complement these RPMs packages with the remaining packages from our list, just using the latest available release version for each package.

or

b) Just use the latest tarball available in the github "Releases" tab, unpack that archive and generate the individual .zip files from there? For example, downloading https://github.com/Azure/azure-sdk-for-python/archive/azure-keyvault_0.3.3.tar.gz, unpacking it and creating the individual .zip files using that archive.

The reason I ask is because each of these tarballs always contain the complete SDK and not just azure-keyvault, for example. Thus, when I download and use the tarball azure-keyvault_0.3.3.tar.gz, can I still assemble a working SDK from that or does that work with the v2.00rc6.tar.gz tarball as it has been tagged as a release of the whole SDK?

It's just confusing that the individual packages and the complete SDK show up on the same "Releases" tab. A releases normally indicates something that is stable - or at least beta - that users can download and use. That's why tagging releases for the individual packages while still containing the complete SDK is confusing as hell.

Adrian

glaubitz commented 7 years ago

To elaborate a little more: I just ran my small script over the unpacked azure-keyvault_0.3.3.tar.gz and it created azure-2.0.0rc7.zip among others, so the resulting SDK I got is something between rc6 and rc7 (since rc7 has not been officially tagged yet).

Fale commented 7 years ago

@glaubitz I'm not really sure that managing tens of highly coupled packages is anywhere easier.... and that's why I opened this ticket (aka: I think it is not possible to manage this project in a sensible way, and this is why - given the IMHO unsatisfactory answers - I'm not packaging this for Fedora/EL)

glaubitz commented 7 years ago

@Fale But you can just download the tarballs from the github releases page and you get all modules in a single tarball. In fact, that's what @irl is doing for Debian and since the Debian version is currently at v2.0.0rc6, it has less modules than are currently visible in the github repository.

I generally don't have a problem juggling with a large number of sources - it's just a matter of good packaging tools after all - I'm just confused as to which versions to use for a stable distribution.

lmazuel commented 7 years ago

Hi @glaubitz

I understand it's complicated, really, :/. Github is not really built to host several packages in one repo. This is some answers:

Let's be pragmatic on what you want (before talking about how to do it): do you want to release one package like Debian like python-azure 2.0.0rc6? Or separate packages for each components? As @rjschwei was saying, the CLI is using each component package independently, so we might have an issue with that. Let's say we can sync azure-cli and azure-sdk bundle package, do you want to:

Once I get what you want the user experience to be, we will figure out the "how".

glaubitz commented 7 years ago

Hi @lmazuel

I understand it's complicated, really, :/. Github is not really built to host several packages in one repo.

You could put each package into a separate git repository and then use git submodules to references these modules in the git repository for the whole package. Lots of projects actually do this when they use third-party libraries like ffmpeg.

Tag are on purpose "_" and are made just for the specific package mentioned in the tag. I don't recommend for instance to use tag "azure-keyvault_0.3.3" to install "azure-mgmt-compute"

Ok, so this means that despite azure-keyvault_0.3.3 containing the whole SDK, I should just always assume the remaining packages are effectively git snapshots and should not be used for anything but development. Thus, when I download azure-keyvault_0.3.3, the azure-mgmt-comput package inside this tarball is probably version 1.0.0rc1 plus some extra commits and shouldn't be used for production.

Thus, anyone wanting to use the releases from github really needs to download the tarball separately for each tagged package version.

Tag like "v2.0.0rc6" are also intended to be accurate for "azure 2.0.0rc6" only, even if I'm pretty sure the state of the repo at this state was correct, according to the content of v2.0.0rc6

Isn't azure supposed to be the primary meta package which allows to install the whole SDK in one step? I'm not sure what would be the point of tagging a version of number for the whole SDK if it doesn't mean the generated tarball doesn't create something that works.

Package on PyPI are not outdated, I'm surprised you got issues? About the issues you found, could send me a more detail email at @microsoft.com?

Sure. Will do that once I have finished writing this message ;).

Let's be pragmatic on what you want (before talking about how to do it): do you want to release one package like Debian like python-azure 2.0.0rc6? Or separate packages for each components?

I want to release separate components. But I also want that these components work with each other, at least that's what users are going to expect. If they use the package manager to install azure, they expect to get the SDK installed ready to be used without having to replace individual components.

For me as the packager, it doesn't really matter whether the whole SDK is released in one tarball or as individual packages. I am writing some simple scripts that will help me deal with the upstream format to generate the RPM packages. What matters is that I know which versions I have to use to be able to assemble something that is going to work in the end on the users side.

For example, if you have released any of the packages in a version which breaks compatibility with most of the other packages, I will naturally not use the latest version of that particular package. I will use the version which is still compatible with the rest and only update once all the other packages have made the transition upstream.

As @rjschwei was saying, the CLI is using each component package independently, so we might have an issue with that. Let's say we can sync azure-cli and azure-sdk bundle package, do you want to:

Release python-azure x.y.x with a lot of packages

Yes, that's what I want. But again, creating a single package out of individual packages or vice versa is not the actual problem. The problem I have is that I don't know which versions are compatible with each other to form a complete, working SDK.

Once I get what you want the user experience to be, we will figure out the "how".

So, here's what I suggest:

If I understand correctly, all the various packages are developed separately. So, these packages should naturally end in separate git repositories. Then use git submodules to link the packages in the main git repository of the Azure SDK. git submodules allows to link specific git commit versions of another repository. Thus, you are able to assemble the SDK from specific versions that are known to work together and you always have something releasable.

If users want to use individual packages, they'll download the tagged tarball from the corresponding package's repository. If they want the whole SDK, they just download the latest tagged version as a tarball.

rjschwei commented 7 years ago

@lmazuel

Ideally we'd have 1 upstream tarball for each, the SDK and the CLI such that we can create

python-azure-sdk-x.y.z and azure-cli.a.b.c packages with azure-cli.a.b.c depending on python-azure-sdk-x.y.z

That's how the other guys do it ;) aws-cli has only a few dependencies with python-botocore being the equivalent to azure-sdk as the primary dependency.

Anyway, I understand, as does probably everyone else interested in this topic, that there are tradeoffs either way and going with a development model of individual components is just as valid a choice as going with a development model that keeps everything together. However, with the chosen model of many components people down stream (packagers or direct users) still need to have some moment in time every now and then where all the pieces fit together. Based on the finding of @glaubitz this point in time is incredibly difficult to determine.

So somehow a mechanism should exist that allows us to pull what would be considered a consistent SDK. If the answer to that is "whatever is on pypi" then that's OK, and maybe we just have to clean up a few things that @glaubitz ran across on pypi and then we are good to go.

lmazuel commented 7 years ago

@glaubitz @rjschwei

About SDK consistency:

Also, the source code truth is the sdist on PyPI. It's easy to get with XMLRPC, example for azure-keyvault 0.3.3:

import xmlrpc.client
client = xmlrpc.client.ServerProxy("https://pypi.python.org/pypi")
[pkg['url'] for pkg in client.release_urls('azure-keyvault', '0.3.3') if pkg['python_version']=='source'][0]

gives https://pypi.python.org/packages/82/8b/9761cf4a00d9a9bdaf58507f21fce6ea5ea13236165afc0a0c19a74ac497/azure-keyvault-0.3.3.zip

I'll discuss it with the CLI team today, I'll see if we can sync our release (for instance each 6 months). I want to release a 2.0.0, and I will try to use the exact same package than CLI 2.0.6. This way you can package azure-python-sdk 2.0.0 as a whole, and package azur-python-cli 2.0.6 as a whole as well, depending of azure-python-sdk 2.0.0

Thoughts?

FYI @johanste

glaubitz commented 7 years ago

On Mon, May 15, 2017 at 10:44:03AM -0700, Laurent Mazuel wrote:

About SDK consistency:

  • For packages who depends on msrestazure, they must be have ">= 0.4". This is the only condition, meaning you can install azure-mgmt-resource 0.30.0rc6 and azure-mgmt-compute 1.0.0rc2 together with no issue. It's consistent in terms of installation, it's just weird in terms of features.
  • For packages that not depends on msrestazure (I think there is three only, azure-servicebus, azure-servicemanagement-legacy and azure-storage), they are independant and consistent from version 0.20.0

Thanks. This answers my question.

Also, the source code truth is the sdist on PyPI. It's easy to get with XMLRPC, example for azure-keyvault 0.3.3:

import xmlrpc.client
client = xmlrpc.client.ServerProxy("https://pypi.python.org/pypi")
[pkg['url'] for pkg in client.release_urls('azure-keyvault', '0.3.3') if pkg['python_version']=='source'][0]

gives https://pypi.python.org/packages/82/8b/9761cf4a00d9a9bdaf58507f21fce6ea5ea13236165afc0a0c19a74ac497/azure-keyvault-0.3.3.zip

Aha, I wasn't aware of that. Thanks for the heads-up!

I'll discuss it with the CLI team today, I'll see if we can sync our release (for instance each 6 months). I want to release a 2.0.0, and I will try to use the exact same package than CLI 2.0.6. This way you can package azure-python-sdk 2.0.0 as a whole, and package azur-python-cli 2.0.6 as a whole as well, depending of azure-python-sdk 2.0.0

Thoughts?

We wanted to have separate packages in SUSE anyway, so that isn't important. I really just wanted to know whether the version dependencies are critical.

Thanks, Adrian

rjschwei commented 7 years ago

On 05/15/2017 01:44 PM, Laurent Mazuel wrote:

@glaubitz @rjschwei

I'll discuss it with the CLI team today, I'll see if we can sync our release (for instance each 6 months). I want to release a 2.0.0, and I will try to use the exact same package than CLI 2.0.6. This way you can package azure-python-sdk 2.0.0 as a whole, and package azur-python-cli 2.0.6 as a whole as well, depending of azure-python-sdk 2.0.0 Thoughts?

That would be great but would require significant changes in the setup of the CLI, i.e. within the components of the CLI the dependencies in setup.py could no longer refer to the individual components of the SDK.

That's a bunch of work that will probably not fit the development model.

lmazuel commented 7 years ago

@rjschwei I'm not sure I get your issue? When you install a distrib package like python-azure-sdk, I think you install the necessary "dist-info" folders, so pip is not able to make the difference between a yum installation and an pip installation correct? So here, if your python-azure-cli depends on python-azure-sdk, and I took care to make them in sync, this should on the contrary makes your life easier?

@irl what do you think about that? Because if trying to sync SDK and CLI bundles makes no sense, I have no reason to do it.

rjschwei commented 7 years ago

@lmazuel , sorry for falling off the face of the planet for a bit and creating a large time gap in the discussion.

You are correct that the installed rpm package will also leave behind the Python information to satisfy installing the CLI bits. Thus if SDK and CLI releases can be synced such the cli-a.b.c depends on sdk-x.y.z then we could go to a one package model and we'd basically have 1 dependency in the CLI package.

However, my concern with this approach would be that, to the best of my knowledge, no tools exist today to ensure this consistency. Of course such tools can be created, but in a sense these tools would counteract the development separation that at this time has been instituted in the SDK and CLI projects.

So if you'd go through the effort to sync everything, which would really be nice for packagers, I think the development model would have to change to a certain degree.

Getting everything in sync would basically mean to collect all the components and verify their dependencies are consistent within each, the SDK and the CLI and are consistent across the boundary. Creating a tool that ensures such consistency should be reasonably straight forward, but it still has to be created and maintained.

However, during the "development phase" this consistency is not necessarily given, meaning CLI component A may depend on version X of SDK component H and CLI component B may depend on version Y of SDK component H. Which is fine as long as at the end of the development cycle both CLI components A and B depend on the same version of SDK component H. This drift makes testing very difficult. Also when there is a security issue because continuous testing is difficult it will not be a good idea to release the security fix from the development branch. The security fix will have to be inserted in two places, the current consistent (synced) code and the development code with a point release off the previous consistent set. This of course can all be managed, but the point is that developers on two teams will have to work more closely together than it appears was intended when the current development model was chosen.

If we look at the same problem using the many-packages approach we can still get into a similar situation, if SDK component H gets a security fix and the version gets advanced. Now the CLI may be broken. However, because the dependencies are not conglomerated we know exactly which CLI packages need potential updates to accommodate the version bump of SDK component H.

To make a long story short, a sync will make initial packaging easier, but individual packages will make dealing with version bumps due to security issues easier.

One thing that would help tremendously would be if you and the CLI team can commit to semantic versioning http://semver.org/ at all levels and change the dependencies in all setup.py files accordingly, i.e.every dependency should be >= one major version and <= the next major version. There should not be any exact version matches enforced. If we can get to that point managing the plethora of packages will be reasonably straight forward.

lmazuel commented 6 years ago

Closing, in favor of https://github.com/Azure/azure-sdk-for-python/issues/1295