ecosyste-ms / roadmap

Planning and roadmap for future Ecosyste.ms development
GNU Affero General Public License v3.0
12 stars 1 forks source link

Map repos to projects, communities, and foundations #19

Open ShaneCurcuru opened 2 days ago

ShaneCurcuru commented 2 days ago

Individual repos are often part of larger projects, which is not always obvious (nor necessarily automatable). But having some rough mapping of repo1, repo2 -> are part of an obvious software project could be useful, even if incomplete.

Similarly, how projects/repos change over time is related to the project's status: is it a couple of JS developers hacking? Is a project part of a consistent and governed community or GH Organization? Is a project part of a FOSS Foundation with documented governance, IP policies, and the like?

Adding metadata structures that can be used to create these mappings is the first step. Some projects are then easily associated, with GH Organizations, or with a handful of FOSS Foundation URLs that are self-declared to house projects at their organization.

I'd love to provide organizational input to this kind of mapping from the Foundations metadata directory. For example, what field(s) should be listed under a FOSS Foundation Schema entry in the directory, to provide attestations of affiliations with repos, projects, or root paths?

Additionally, is there enough research interest - and implementation interest - to map both repo <-> project relationships, and also map repo|project <-> foundation relationships? For example the ASF as a foundation might have data like this, that explicitly attests these other items are governed at the ASF:

ghorgs: 
- https://github.com/apache/

repoRoots:
- https://github.com/apache/
- https://svn.apache.org/repos/asf/

projects:
- https://whimsy.apache.org/public/committee-info.json # ['committees']

# Individual projects may also have repo attestations that could be scraped

For reference: https://github.com/Punderthings/fossfoundation/blob/main/_data/foundations-schema.json

andrew commented 2 days ago

Definitely interested in this, I've been pondering a "projects" service that acts as a higher level than just repos and packages.

With lots of potentially different ways of mapping and connecting projects, one key element is that it should be automatable, pulling from multiple sources to ensure all the work of mapping the landscape doesn't fall on the shoulders of a few doing lots of manual work to try and keep on top of things.

The fossfoundations dataset is an excellent source for that, another area I've been working closely with recently is grouping projects based on where they say their funding sources are, such as open collective or github sponsors which could be another source of grouping data. Wikidata may be another interesting source we could pull from?

ShaneCurcuru commented 2 days ago

Agreed all around. Thinking about all the researchers I've been talking about lately, it feels like we need a good-enough model for the mapping to get started, and then start quantifying 1) ways entities can self-declare relationships or their modeling terms, and 2) look for all the easily discoverable implicit relationships to see how far that can get an initial scan. I think we'd cover a reasonable percentage of cases with asking FOSS Foundations to do some really simple self-attestations of ownership. This could also be a really easy thing to ask to add to GitHub's default files/repo settings, if it's clear there will be use (i.e. things above the Organization concept).

One key point: while data will never be magically complete, it feels like we'd need to annotate any mappings we auto-discover, that aren't explicitly stated. For examples:

How do we annotate those mappings differently? In some cases they will absolutely have different meanings (for research/comparison purposes).

Separately: what are the right objects to map above the repo scale? In particular, what's the right set of objects that are easy enough for practicioners to care about, but that will also give the expected kinds of data that large-scale researchers are expecting?

Product or Project is a technical term, for recognizeable software products that directly include one or more repos. This is important (if imprecise), because it's how practicioners think, and also is useful for research about how software is built, used, and lifecycle.

Organization should be an obvious entity; FOSS Foundations and software corporations cover a lot of the scope and have some simple auto-discovery available.

Funder is not the same as organization. They might be the same in some cases (NumFOCUS may pay for development of some of their projects directly, that are also governed there), but often will be different.

Community is hard to define, but I feel is important. This is for all the projects that have a clear set of maintainers, maybe a website, and a clear product being built. But there is no legal entity behind the project, and governance or other policies are thus less likely to be consistent than organization-owned projects.

Thanks! What do you see as the next steps for sketching this out, and where's best to collaborate? I ask because there are a couple of research work efforts that may be related here that might want to help.

andrew commented 2 days ago

Another couple of thoughts of the top of my head, copyright notices could potentially help connect repos to orgs.

For communities one source I've been using a lot recently is the featured topics on GitHub, whilst not great for really small projects, it does have some nice metadata to go along with it, I've mapped them all out here: https://awesome.ecosyste.ms/topics, there's raw json from GitHub over here: https://explore-feed.github.com/feed.json

Another question is scope and size, how far down the long tail of projects do we go, do you have a feel for what kind of size or popularity of projects we should consider, or maybe start with just the biggest and work our way down? For context, currently in my repos database there are 1,356,751 organizations and 21,382,958 users with public repositories.

Perhaps we could arrange a call to sync up and plan next steps? my email is andrew@ecosyste.ms if you want to send an invite?

bzg commented 1 day ago

I'm very interested in this too, for https://code.gouv.fr/sources and, hopefully, other similar efforts.

See comptes-organismes-publics.yml where we list "hosts" and "owners" from the French public sector.

We'd like to group owners and repos together in a more project-based way.