Proposing an extra package metadata file #435

Closed daveverwer closed 3 years ago

Sven and I agreed that we should start a slightly more structured issue around the proposal of a metadata file that packages may add. So here goes.

Why?

The primary goal of this site is to help people make better decisions about the packages they are choosing. The metadata we currently use to help people make those decisions comes from the manifest, the repository, and GitHub. There's more we could do with better information though, so we're considering proposing a standard metadata file that package authors can use to inform the Swift Package Index better. It would also mean that other hosting providers (for example GitLab and Bitbucket, as well as self-hosted repositories) would be on equal footing with GitHub.

What is the file, and where is it stored?

We see this metadata file hosted in the root of a package repository. It's better for everyone if the package author is in control of the file, and of course, it means that other projects can also take advantage of the information.

What information would it include?

This thread aims to gather feedback from the community on what information would be useful in this metadata file. I will update the list here as feedback comes in. Here's what we have so far:

Package information:
- Abstract (thanks Erica)
  - A shorter, one line description.
- Description (details here)
- Categories/Tags:
  - Manually defined (thanks to Erica)
  - Some tags could potentially be derived from imported frameworks (see here). This would not be in the metadata file of course, but I think is worth mentioning here.
- Home page URL
- Documentation page URL
- Example project URL (thanks James)
- Auxiliary URLs (thanks Dave)
  - A set of any number of other URLs to other resources, we’d need to capture both a URL and the text to use as the link.
- Is the package deprecated?
  - If so, is there a successor package?
- Related packages: (multiple items of) (thanks James)
  - Package URL (The name and all other metadata can be derived from this)
- License (thanks Johan) - See comments here too.
- Maintainer: (thanks Erica) (exactly one of)
  - Email address
Author information: (multiple sets of)
- Name
- Email address
- Personal URL (Home page, Twitter, GitHub page, etc…)
Funding/sponsorship/donation information: (thanks Max)
- Does the package accept funding?
- Funding URL
Other platform support status:
- Linux (thanks Max)
- A boolean is the simplest way to declare support.
- Explicit support for named versions of Linux is more comprehensive.
- Windows, and other platforms as they are added. (thanks Erica)

By definition, as the file will not exist in all repositories, all of this data will be optional. No package author should be required to add any data that they are not comfortable with sharing.

There is a valid issue with versioning the metadata file brought up by Mattt here. For the Swift Package Index, while we would potentially store this information against the versions, The Swift Package Index would use the latest version of the file on the default branch to build the package pages from.

Structured or Unstructured data

There's an argument to make all non-technical metadata a "tag" like structure. For example indicating categories, Linux support, and other things with string tags as mentioned by Erica here.

We want to bring as much of the information that's needed to judge the quality of a package into one place. For example, instead of having to check how many pull requests/issues there are and when the last one was closed, we bring that in automatically, right alongside information about what versions of Swift the package supports, and whether the stable release is the right one to target, or if there's actually a beta which would better suit your needs.

All of that data so far is structured as it comes from the manifest, from GitHub, and from the repository itself. There's a place for unstructured/tag-based data, but I don't think it completely replaces the need for structure.

We also want to use some of this structured data to drive a "quality score" for a package. I don't think it's clear yet whether this quality score is made public, or just used internally for search ranking (we have a version of this already) there are pros and cons to both. But, if metadata is just tag-based, it's much harder to do that. Especially when tags can be typed incorrectly or interpreted in different ways (do linux and ubuntu-18.04 get points for supporting Linux, where ubuntu1804 doesn't?). It's definitely a trade-off. -- Just a note, I'm not saying packages would definitely get an increased score for supporting Linux, it's just an illustration.

Scope of this thread

I think it's worth keeping the discussion to the information at this point, rather than the specifics of the format that we’ll use to represent it. That's a separate discussion, and the data we decide to include will influence the format.

At this point, we should include ALL suggestions for metadata. Of course, it's fine to put forward views on why you feel a piece of metadata shouldn't be considered. But I won't remove any until we've got a comprehensive collection of everything under consideration though. I’ll keep this list above up to date as more suggestions are added in comments below.

Eventually moving metadata into Package.swift

If this process is successful, it's worth considering whether this metadata should be merged into the Package.swift manifest through the evolution process. It's an idea, but probably only for some of the metadata.

The package manifest holds information about the technical details of the package, and I think we should be careful mixing in descriptive metadata in with that. So, if we see that Linux support is something we see people use this metadata file for. We think that would make a great addition to the official manifest. However, for things like description, tags, author information, etc… a separate file feels better.

A note about the package description:

We currently get the package description from the GitHub repository as all packages are currently hosted there. We’d rather that not be a hard requirement in the future, and while other hosting services will surely have a repository description, it seems like a sensible thing to do to have the package description be in the metadata file as well.

A few thoughts:

Although tags are manually defined, I think it highly beneficial to establish a tag registry for high quality standard tagging.
I would like to see one line abstract in addition to a discussion. I believe there's value in having short and long discussions as a fundamental part of the metadata.
Maintainer is a different category from author or authors or contributors. A maintainer email is a single point of contact, which may be to a single person or to an organization.
Linux support status I think may be better defined as platforms, as Linux, Windows, Amazon, as well as phone, watch, mac, tv can be target platforms.
Agreed that package descriptions should not be specific to a hosting platform.

Scattered thoughts

Maybe a way to have other URLs? For example "a URL to a blog post about this" or "a URL to a Swift Forum thread where this was discussed"
I'm not sure the linux support is necessary, since that should be in Package.swift as one of the supported platforms.
Importing tags should not be transient (ie, don't union together the tags of dependencies), since there's no way to know whether a dependency is part of the package's public interface or internal implementation.

The license under which the package is available. (Or is the current loosely defined license (case insensitive) file with or without txt or md or other extension sufficient?)

We are currently reading license information from Github's API (and that is tied to thee license file). This would be a case where having it in the metadata file would be redundant but also beneficial to index projects from other hosting providers.

Perhaps out of scope for something like this but something to indicate where an example project is could be helpful? This might be relative to the root of the repository or an entirely separate repository. This could, in the future, lead to something similar to pod try.

Also maybe some way of doing related packages? Take a package like Alamofire which then has AlamofireImage and AlamofireNetworkActivityIndicator (owned by the same organisation) but also more community-led extensions like Nimble has extensions for SnapshotTesting.

edit://

In theory this second part could be automated with some sort of reverse dependency detection. In essence looking through every SPM module in your database which has a dependency on the the subject module. Though this would likely be better as a standalone page (especially for modules like Alamofire) with the metadata file being more controlled/maintainer led.

Just a clarification on the endorsements idea: Although I think self-proclaimed endorsement is a fine idea, I actually imagined the Swift Package Index maintaining a list used for whatever small number of endorsements were deemed worth exposing in the index. SSWG endorsements, for example, grow slowly enough it would actually be reasonable to track them manually, although a website scrape of the official webpage would also probably be reasonable.

It’s still metadata so no reason not to have it in this wonderful list of possibilities.

GitHub Action build statuses? Metadata could include Action names so badges can be grabbed easily to be displayed.

I say this without knowing how the GitHub API is for looking up build information, other than looking in the repo manually.

Thanks Max!

GitHub Action build statuses? Metadata could include Action names so badges can be grabbed easily to be displayed.

I think the best way to include this data is via the README file. Most people put this kind of badge at the top of the README and we're planning to implement that with #410. Do you think that'd fill the need? I'm happy to add it to the list, but I have a feeling the README may be a more flexible way to achieve the same goal.

I say this without knowing how the GitHub API is for looking up build information, other than looking in the repo manually.

It's not about how to do it at this stage 👍

Just a clarification on the endorsements idea: Although I think self-proclaimed endorsement is a fine idea, I actually imagined the Swift Package Index maintaining a list used for whatever small number of endorsements were deemed worth exposing in the index.

Thanks for clarifying this Max, and I agree. Removing this from the metadata above and re-opening your existing issue so we can track that separately.

Perhaps out of scope for something like this but something to indicate where an example project is could be helpful?

Thanks James! Added above. We also have some pretty neat ideas about a pod try kind of situation, you might want to listen to this to find out a bit more. 🚀 Especially Sven's story at the start.

Also maybe some way of doing related packages? Take a package like Alamofire which then has AlamofireImage and AlamofireNetworkActivityIndicator (owned by the same organisation) but also more community-led extensions like Nimble has extensions for SnapshotTesting.

Interesting idea, added above. There's an argument for these to be manual, the package author might want these to be carefully curated.

The license under which the package is available. (Or is the current loosely defined license (case insensitive) file with or without txt or md or other extension sufficient?)

We are currently reading license information from Github's API (and that is tied to thee license file). This would be a case where having it in the metadata file would be redundant but also beneficial to index projects from other hosting providers.

Yea I think we need to be very careful letting packages self-describe their license. It could lead to us indicating that packages are licensed with a certain license when they are not. Happy to continue the discussion here, but I'll not add this above yet.

Yea I think we need to be very careful letting packages self-describe their license. It could lead to us indicating that packages are licensed with a certain license when they are not. Happy to continue the discussion here, but I'll not add this above yet.

Best I can think of here actually would be to let packages self-describe their license, but to use GitHub's license as an override if it's hosted available.

I've added it above

I think the best way to include this data is via the README file. Most people put this kind of badge at the top of the README and we're planning to implement that with #410. Do you think that'd fill the need? I'm happy to add it to the list, but I have a feeling the README may be a more flexible way to achieve the same goal.

Yes absolutely that’ll do it! Didn’t realise there was a potential plan to include the README 👍 I’ll watch that issue.

Thanks Dave!

Maybe a way to have other URLs? For example "a URL to a blog post about this" or "a URL to a Swift Forum thread where this was discussed"

Interesting, almost like a collection of auxiliary URLs. There's an argument to say these should be in the README (which we're planning to bring into the package page with #410) but also added above.

I'm not sure the linux support is necessary, since that should be in Package.swift as one of the supported platforms.

I am 100% on board with this, but as I understand it this was considered when platforms were added to Package.swift and didn't make it in. I'd rather have some way to specify it now, and then deprecate it in favour of the Package.swift version if it ever gets reconsidered and added there.

Importing tags should not be transient (ie, don't union together the tags of dependencies), since there's no way to know whether a dependency is part of the package's public interface or internal implementation.

I agree completely, that'd be chaotic 🙈

Thank you Erica!

Although tags are manually defined, I think it highly beneficial to establish a tag registry for high quality standard tagging.

That's a really interesting idea, and I like it.

My guess is that we'd need a tool to validate these files at some point, would you see this tag validation being part of that tool?

I would like to see one line abstract in addition to a discussion. I believe there's value in having short and long discussions as a fundamental part of the metadata.

Great idea. Added above.

Maintainer is a different category from author or authors or contributors. A maintainer email is a single point of contact, which may be to a single person or to an organization.

Also a great point, added above.

Linux support status I think may be better defined as platforms, as Linux, Windows, Amazon, as well as phone, watch, mac, tv can be target platforms.

Added above.

Although tags are manually defined, I think it highly beneficial to establish a tag registry for high quality standard tagging.

On this I'd just like to leave PyPI's list of classifiers here, which are used for very structured tagging of python libraries on PyPI.

There's going to be a fine balance between making this rich and structured and having package authors be able to complete it correctly.

Unlike with the PackageList, where we can validate data as it's imported, this is just a file in someone else's repository. Even if we build a validator tool it's going to be up to package maintainers to run it.

I think we should be quite strict on the SPI side of things and ignore data that is not formatted correctly, but we should also be careful not to make it too hard for people to fill in.

That's not to say a structured list of tags is a bad thing, or that we shouldn't do that, just that there's a trade off with it.

Shouldn't deprecation be able to result in multiple successor packages? If a big package like Vapor falls, you can bet that there are many (potential) successors. Also, wouldn't there be a set of maintainers for big packages?

Shouldn't deprecation be able to result in multiple successor packages? If a big package like Vapor falls, you can bet that there are many (potential) successors.

Maybe, but the more I think about this deprecation URL the less I feel it should be a package URL and would work better as just a URL. If the deprecation is a simple "Please use this new package instead of this one" message, maybe it'd be a package URL, but if it's a more complicated situation or a larger package being deprecated, it's unlikely a list of package URLs would accurately express the reasons and deprecation plan.

Instead, I think this should be a single URL where people should look for more information. Whether that be a new package, a blog post, a page on a site, anything.

Also, wouldn't there be a set of maintainers for big packages?

That's what I was thinking originally would be the Authors, @erica's point was that the maintainer would be a single point of contact for the project. I'm not sure it's beneficial to have multiple authors and multiple maintainers listed.

I do think this bit of the idea (authors and maintainers) needs more work. I'll think on it.

I do think this bit of the idea (authors and maintainers) needs more work.

The idea of separating these concepts might end up being more about vanity than utility to the person looking for a new package or trying to contact someone about a new package.

Perhaps the index need only show a combined list of “Authors & Maintainers” and perhaps more radically the index should not offer up any form of contact information (instead, only names and profile URLs) — not that there shouldn’t be ways to find out more but that perhaps it should not be too easy to ask a question directly of a maintainer in a private message when the question could be readily answered by visiting the README and/or the community benefits from the question/answer being in a public forum like a GitHub issue.

FWIW, I also don't see much benefit in splitting authors and a maintainer (even if it is an org). The approach I personally like the most from other package managers is just using a list of collaborators which are usually defined as Some name <somemail@example.com> (I forget the name of the format, but basically what E-Mail does as well). This has the added bonus of not caring if a value is for a specific person, a team or an entire org.

I'd like to discuss two more things now this has been open for a while, but on separate threads to keep things organised.

What format should the extra metadata file be? #462
What should the extra metadata file be named? #463

It’s also worth looking at what other package managers/dependency managers do in terms of this.

Generally, they keep it much simpler than we are proposing here. For example:

CocoaPods

Pod::Spec.new do |spec|
  spec.name          = 'Reachability'
  spec.version       = '3.1.0'
  spec.license       = { :type => 'BSD' }
  spec.homepage      = 'https://github.com/tonymillion/Reachability'
  spec.authors       = { 'Tony Million' => 'tonymillion@gmail.com' }
  spec.summary       = 'ARC and GCD Compatible Reachability Class for iOS and OS X.'
  spec.source        = { :git => 'https://github.com/tonymillion/Reachability.git', :tag => 'v3.1.0' }
  spec.module_name   = 'Rich'
  spec.swift_version = '4.0'

  spec.ios.deployment_target  = '9.0'
  spec.osx.deployment_target  = '10.10'

  spec.source_files       = 'Reachability/common/*.swift'
  spec.ios.source_files   = 'Reachability/ios/*.swift', 'Reachability/extensions/*.swift'
  spec.osx.source_files   = 'Reachability/osx/*.swift'

  spec.framework      = 'SystemConfiguration'
  spec.ios.framework  = 'UIKit'
  spec.osx.framework  = 'AppKit'

  spec.dependency 'SomeOtherPod'
end

Ruby

Gem::Specification.new do |s|
  s.name        = 'example'
  s.version     = '0.1.0'
  s.licenses    = ['MIT']
  s.summary     = "This is an example!"
  s.description = "Much longer explanation of the example!"
  s.authors     = ["Ruby Coder"]
  s.email       = 'rubycoder@example.com'
  s.files       = ["lib/example.rb"]
  s.homepage    = 'https://rubygems.org/gems/example'
  s.metadata    = { "source_code_uri" => "https://github.com/example/example" }
end

Python

setuptools.setup(
    name="example-pkg-YOUR-USERNAME-HERE", # Replace with your own username
    version="0.0.1",
    author="Example Author",
    author_email="author@example.com",
    description="A small example package",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/pypa/sampleproject",
    packages=setuptools.find_packages(),
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent",
    ],
    python_requires='>=3.6',
)

JavaScript

Full specification here, but here’s an example file:

{
  "name" : "underscore",
  "description" : "JavaScript's functional programming helper library.",
  "homepage" : "http://documentcloud.github.com/underscore/",
  "keywords" : ["util", "functional", "server", "client", "browser"],
  "author" : "Jeremy Ashkenas <jeremy@documentcloud.org>",
  "contributors" : [],
  "dependencies" : [],
  "repository" : {"type": "git", "url": "git://github.com/documentcloud/underscore.git"},
  "main" : "underscore.js",
  "version" : "1.1.6"
}

There's more interesting discission happening over at the Swift forums on this subject. Tagging here too just to keep everyone aware of the ongoing conversation.

Closing this as we are no longer going to do a large scale thing with this, we will add fields to .spi.yml as they are needed.

SwiftPackageIndex / SwiftPackageIndex-Server