chadwhitacre / openpath

https://openpath.quest/
10 stars 1 forks source link

Better define success #20

Open chadwhitacre opened 9 months ago

chadwhitacre commented 9 months ago

Coming here from #9 and #16. I did some napkin math years ago that put a company's fair share at $2,000/dev/yr. Last month The Value of Open Source Software came out. It feels so, so fluffy to me. It did get me thinking though, that if there are only a few thousand developers producing all of Open Source, then we should be able to get pretty specific about what their needs are to achieve sustainability:

Open Source sustainability is when any smart, motivated person can produce widely adopted Open Source software and get paid fairly without jumping through hoops.

Let's figure out how to fund the few thousand developers currently out there today. That will be the basis for growing the maintainer community beyond its current state.

The question here is: what does success look like? How much money per year do we need to fund the few thousand devs that are currently maintaining Open Source? What is each company's fair share?

chadwhitacre commented 9 months ago

From internal comms strategy doc:

Proof points that showcase there is an OSS sustainability crisis, as that is the crux of our story. Without showing that there is a tangible problem (not just theories), we wonโ€™t have a story. E.g.:

  • Decrease in donations?
  • Shrinking OSS projects?
  • Increased security vulnerabilities (the next Log4j)?
  • Impact to businesses bottom line?
  • More complications with building software?

Address w/ this.

chadwhitacre commented 9 months ago

One in four Open Source maintainers burn out.

44% of 58% = 25%

stackedsax commented 9 months ago

One in four Open Source maintainers burn out.

44% of 58% = 25%

So, OSS is a good measure of who can suffer the best? ๐Ÿ—ก๏ธ

From your last post, I'm not sure I can fully agree with one of your conclusions:

Based on my reading, companies would spend almost nothing more if OSS didnโ€™t exist.

Unfortunately, there is no specific data I can point to to back up my feelings :D. However, if we were in a buy-only world, I do think that:

a. prices for software would be much, much higher than they are today b. companies would spend a lot more on training for specific, proprietary platforms

I bet there would be other ramifications on the price of "Buy/Build" if "Borrow" were taken off the table entirely but I haven't thought through all of them yet. I felt like your conclusions were based on what the cost of software is right now in a world where "Borrow" keeps the price of software lower than it might otherwise be. Just as you point out (correctly) that the authors of that HBS paper weren't considering the "Buy" option in their comparison, I think some more consideration needs to be given to the effect on the price of software when "Borrow" is taken off the table.

Sorry I didn't get to stay for the full discussion earlier today. Thanks for this, though!

chadwhitacre commented 9 months ago

Thanks for weighing in @stackedsax!

I bet there would be other ramifications on the price of "Buy/Build" if "Borrow" were taken off the table entirely

Fair enough, and the considerations you point out are good ones (prices being passed on, training costs). Perhaps a more nuanced approach would increase the number above the $177 million reported in the appendix for a naive goods market approach, though I'm not sure it would approach the $8.8 trillion for the naive labor market approach.

In any case, the thing in the working paper I found most suggestive was the part about how few developers there are who produce the bulk of Open Source software. Rather than engage in speculative thought experiments (which, to be clear, I've also certainly done), I'm intrigued at the idea of determining who the developers actually are and figuring out which of them need how much money to be sustainable. The suggestion in "The Value of Open Source" is that there are only a few thousand of them(!). This seems quite tractable and much more productive.

andrew commented 9 months ago

I'm intrigued at the idea of determining who the developers actually are and figuring out which of them need how much money to be sustainable.

I think I've got a lot of data to help with that question, would be happy to help investigate.

I also did some similar research back in 2018 and found that just 262 people maintained the majority of the whole of the rubygems community! https://youtu.be/hW4wUpoBHr8?feature=shared&t=708

chadwhitacre commented 9 months ago

Yay @andrew! ๐Ÿ˜„ 99.95 to 0.05 is quite a dramatic figure. On the one hand, it's alarming, to think that so many depend on so few. In another light, it can almost be seen as encouraging, because it's a much more tractable problem to fund 262 people (in the few-years-ago Ruby case) vs. an amorphous unknown group that we think would need "trillions of dollars."

How do you think we might best go about investigating this?

The sort of thing I think would be valuable to be able to say is:

  1. There are N,000 developers responsible for producing 99% of Open Source software.
  2. MM% percent of their labor is voluntary and sponsorable.
  3. Fair wage is $X00,000.
  4. Total needed is $Y00,000,000.
  5. Industry-wide fair share per FTE dev is $Z,000/yr.

Thoughts on how to fill in those blanks? ๐Ÿค”

(3) and (5) might want geo variation.

andrew commented 9 months ago

Focusing on 1. for now:

What I did with rubygems was total up the amount of downloads across all, then find the most downloaded gems which made up the majority of the downloads, the long tail of projects that don't get any downloads is very long.

If we take this approach, we can do a breakdown by software ecosystem, because things like download counts aren't really comparible between say NPM and CRAN (the R package manager), the top 1% of R packages downloads will look negligable compared directly to NPM but are still very impactful within their community (scientific and statistical software).

Steps for each ecosystem:

Notes:

I suspect we'd see some relation between the number of top maintainers and the total size of the ecosystem (more for NPM than for CRAN for example).

Not all ecosystems have download counts but there are other comparible metrics that can be used for most large ecosystems.

I'm focusing primarily on software projects delivered via package managers as that's what I have the most data for now.

Questions/options/further investigations:

Once we have that list, moving on to the next points, do we want to trying and guestimate the amount of maintaince work a given project takes given it's current activity levels, or just treat them all the same. For example, https://www.npmjs.com/package/left-pad probably doesn't need as much ongoing maintaince work as something like https://github.com/pytorch/pytorch

stackedsax commented 9 months ago

vs. an amorphous unknown group that we think would need "trillions of dollars."

I like the thought that a known quantity of 262 developers will get to slice up a trillion dollars for themselves ๐Ÿ˜„

chadwhitacre commented 9 months ago

/me considering how to make it 263. ๐Ÿค”

chadwhitacre commented 9 months ago

It seems like we could use this for "2. MM% percent of their labor is voluntary and sponsorable." Do the professional and semi-professional maintainers count for voluntary and sponsorable, though? ๐Ÿค”

State_of_FOSS_Funding_FOSDEM_2024_IQwHRvb

(slide source, numbers source)

cc: @karasowles

chadwhitacre commented 9 months ago

Folks let's do this. Let's come up with a methodology and refresh it annually. "Fair share." Definitive, credible, simple to apply, "Here's what your company should be paying and here's how you can pay it." I've updated the issue description along these lines.

@andrew Your notes for (1) seem like a reasonable place to start.

What the cut-off point is for the definition of "most popular", is 99% too high?

I think we might need to look at the distribution curve for each ecosystem. Might be closer to 80/20 for some, closer to 99/1 for others. Can we get little sparklines for each eco?

Also what is the definitive list of ecos we are working with? The answer is not jumping out at me on https://ecosyste.ms/.

what do we consider as responsible for in a developer, did they make a majority of the commits or are the only person who's committed/published in the past couple years or some other measure.

What is everyone's thinking on person vs. project? I tend to think funding flow should focus on company to project, and each project should be responsible for flow from there to people. It's the projects that companies need to exist, and people need to be free to move in and out of projects. I know a guy who was making $500+/mo on GHS from the Clojure community even though he had moved on from Clojure. He ended up turning off his Sponsors account. From the POV of any companies depending on those libraries, it would've been better for funding to have been kept in place to incentivize a new maintainer to step up.

publically publishing the aggregation of all that personal data may be a bit of a privacy nightmare, should we anonymize it and how?

Good call. Hash the email addresses and use that as an ID?

andrew commented 9 months ago

Also what is the definitive list of ecos we are working with? The answer is not jumping out at me on https://ecosyste.ms/.

We can start with the package ecosystems listed here: https://packages.ecosyste.ms/

chadwhitacre commented 9 months ago

There it is! Knew it had to be somewhere. :-)

This seems like quite a comprehensive set of ecosystems to address.

andrew commented 9 months ago

First off I'll see if I can recreate similar data for rubygems that I did back then and in theory a script for one ecosystem should work for all of them then.

chadwhitacre commented 9 months ago

I published a CTA for people to join us here. Will promote tomorrow, lmk if you have early feedback. ๐Ÿ™

andrew commented 9 months ago

Quick first pass at the key rubygems maintainers:

Sum downloads for a registry:

r = Registry.find_by_ecosystem 'rubygems'
total_downloads = r.packages.active.sum(:downloads)
=> 155795984192

find 80% of total downloads:

target = (total_downloads * 0.8).round

Start from most downloaded package, sum downloads, keep fetching packages until target is reached:

count = 0
packages = []
r.packages.active.order('downloads desc nulls last').each_instance do |p| 
  puts "#{p.name} - #{p.downloads}"
  count += p.downloads
  packages << p
  break if count > target
end

For rubygems.org:

952 packages account for 80% of all downloads (0.5%)

24825 packages account for 99% of all downloads (13%)

unique maintainers:

maintainers = Set.new
packages.each do |p|
  puts "#{p.name} - #{p.maintainers.length}"
  p.maintainers.each do |m|
    maintainers << m
  end
end;nil

For 80% of downloads:

920 unique maintainers of 952 packages (1.4% of all maintainers)

For 99% of downloads:

16098 unique maintainers of 24825 packages (25% of all maintainers)


Generate a csv of downloads and maintainers of top 10,000 most downloads ruby gems:

csv = CSV.generate do |csv|
  csv << %w[name downloads maintainers]

  r.packages.active.order('downloads desc').limit(10000).each_instance do |package| 
    csv << [package.name, package.downloads, package.maintainers.map(&:uuid).join(',')]
  end
end

csv: https://gist.github.com/andrew/815014222ccf1825b37defc004454446

Downloads chart of csv:

Screenshot 2024-02-06 at 12 21 03

Notes:

andrew commented 9 months ago

Other ecosystems where I've got both downloads and maintainer data for in https://packages.ecosyste.ms that I can run the same analysis for:

for other ecosystems we can get some form of maintainer data from the source repositories and look at using other popularity metrics (dependents, stars, forks, docker downloads etc)

andrew commented 9 months ago

Very back-of-the-napkin maths of extrapolating the rubygems numbers to all the other ecosystems I have maintainer data, would be around 7100 maintainers of the top 1% of open source packages.

jonathan-s commented 9 months ago

Good analysis @andrew. To get a fairer picture I think it would be important to also look at commit data. Another approach albeit frozen in time, would be to look at the repositories that was archived by github.

https://archiveprogram.github.com/

The full list of those repos can be found here > https://archiveprogram.github.com/assets/img/archive-repos.txt

andrew commented 9 months ago

I've also setup a daily cron to sum up the total downloads for each registry, visible on the homepage: https://packages.ecosyste.ms and will be available in the API shortly too: https://packages.ecosyste.ms/api/v1/registries/

With this it should be possible to generate most of this analysis just using the packages api in future.

Screenshot 2024-02-06 at 13 22 50

andrew commented 9 months ago

Good analysis @andrew. To get a fairer picture I think it would be important to also look at commit data. Another approach albeit frozen in time, would be to look at the repositories that was archived by github.

I'm also indexing commit data from github, gitlab, codeberg etc over here when we're ready for it: https://commits.ecosyste.ms

Not all packages have a strong reference to a source repository, so we'll need to fill in the gaps in certain places.

andrew commented 9 months ago

The full list of those repos can be found here > https://archiveprogram.github.com/assets/img/archive-repos.txt

I'd not seen this before, some how 4 projects I created on in there?!

jonathan-s commented 9 months ago

I'd not seen this before, some how 4 projects I created on in there?!

I guess a congrats is in order ;). You are forever memorialized into the future (or at least for a 1000 years).

ljharb commented 9 months ago

lol any chance being on that list might be enough evidence to get your own wikipedia page? :-p

andrew commented 8 months ago

A quick summary of ecosystems that I have quick access to download and maintainer data for:

Ecosystem Total Packages Top 80% Maintainers Downloads
pypi 497837 441 (0.09%) 757 23264637101
npm 2330098 2093 (0.09%) 1216 177289183672
rubygems 177926 951 (0.53%) 920 124688544637
clojars 20097 277 (1.38%) 108 1561286976
nuget 610550 366 (0.06%) 56 411233896188
packagist 384111 538 (0.14%) 279 80747931494
cran 19643 332 (1.69%) 159 120761763
hex 15474 143 (0.92%) 116 8871087397
cargo 136246 743 (0.55%) 380 42636949569
Total 4191982 5884 3991 870414278797

Note: excluding hackage and bioconductor for now as they are much smaller and don't have the same kind of 80/20 skew.

So around 4,000 developers are responsible for packages that make up 80% of downloads for those 9 registries of 4.2m packages, totalling 870 billion downloads, the 0.1% if you will!

jonathan-s commented 8 months ago

I'm curious how things change for the 90 percentile and for the 95 percentile.

andrew commented 8 months ago

@jonathan-s 90%:

Ecosystem Total Packages Top 90% Maintainers Downloads
pypi 497889 919 (0.18%) 1225 26166150952
npm 2336437 3877 (0.17%) 2133 199433809146
rubygems 177927 1764 (0.99%) 1710 140248934988
clojars 20098 498 (2.48%) 181 1755824250
nuget 610582 794 (0.13%) 175 462438386704
packagist 384127 1446 (0.38%) 753 90830030532
cran 19647 860 (4.38%) 435 135821394
hex 15474 213 (1.38%) 161 9975933693
cargo 136274 1497 (1.1%) 659 47959387259
Total 4198455 11868 7432 978944278918
andrew commented 8 months ago

and 95%:

Ecosystem Total Packages Top 95% Maintainers Downloads
pypi 497889 1633 (0.33%) 1994 27620451752
npm 2336497 6821 (0.29%) 4369 210513282945
rubygems 177927 3231 (1.82%) 2855 148033484902
clojars 20098 843 (4.19%) 303 1853084472
nuget 610582 1792 (0.29%) 492 488117527128
packagist 384127 3587 (0.93%) 1623 95875170802
cran 19647 2352 (11.97%) 1211 143358753
hex 15474 354 (2.29%) 272 10530060633
cargo 136274 2698 (1.98%) 1114 50620295454
Total 4198515 23311 14233 1033306716841

Note: Over 1 trillion downloads!

chadwhitacre commented 8 months ago

Gotta ask ... 99th percentile, for completeness?

andrew commented 8 months ago

Gotta ask ... 99th percentile, for completeness?

Only because it's you @chadwhitacre

Ecosystem Total Packages Top 99% Maintainers Downloads
pypi 497890 4348 (0.87%) 4650 28782331165
npm 2337014 22047 (0.94%) 12702 219376938349
rubygems 177927 22995 (12.92%) 15217 154266510490
clojars 20098 2412 (12.0%) 803 1931011124
nuget 610585 15556 (2.55%) 4369 508657978459
packagist 384127 14290 (3.72%) 5960 99911408288
hex 15474 1255 (8.11%) 824 10972912941
cargo 136276 7379 (5.41%) 2658 52751601635
Total 4179391 90282 47183 1076650692451

Note: something went wrong with cran, a stray null that I didn't handle, so just left that row out for now (it's late in the UK)

One extra note: There may be some maintainers that are in the top 1% in multiple ecosystems, but they are currently treated as different people in these calculations (if there are they are incredibly productive people!)

jonathan-s commented 8 months ago

If you then also were to include contributors that at least contributed 1% of all commits the ballpark here is that the amount of people would increase by about an order of a magnitude. (Checking sentry and django as two examples, for both projects ~20 people have each contributed at least 1% of all commits).

gordonbrander commented 8 months ago

(Cross-posting thoughts from DMs at @chadwhitacre's request).

Funding 5000 is a really great and approachable problem definition. A few follow-on thoughts, offered in a yes-and spirit.

From a narrative perspective, I suspect "sustainability/fair share" frame might encourage zero-sum thinking. "Our slice of the pie". From my perspective, the pie isn't fixed. Rather, we're looking at a broken feedback loop and a resulting ecological desert.

There's probably some upper limit to this positive ecological feedback, defined by the rate at which a market can absorb new products, but I doubt we've reached that yet. Rather, we seem to be limited at the level of fundamental research / new low-cost enablers due to lack of non-speculative funding. Low-cost enablers are what open source is all about. Let's terraform the ecological desert :)

One more thought: the graph of active maintainers clearly follows a power law distribution. This is no surprise! Approximate power laws are intevitable in all evolving networks. So, there will be a very few startups that make exponentially more money, a very few maintainers that make exponentially more open source impact, etc.

However, the exponent matters a lot! Even small changes in exponent make enormous differences in qualitative behaviour.

Another way we might frame the problem is to "fatten the curve". Change the exponent so that the number of high-performing open source contributors is not 5k but 50k, 500k...

This is what expanding the carring capacity of the open source ecosystem could do. Change the exponent.

anehzat commented 8 months ago

/me considering how to make it 263. ๐Ÿค”

It's already happening & picking up momentum :)

chadwhitacre commented 8 months ago

Thanks for jumping in, @gordonbrander! :-)

"sustainability/fair share" frame might encourage zero-sum thinking.

I'm thinking of this in terms of designing institution(s) to manage Open Source as a common pool resource, lots more to say about that under #14.

As to the exponentโ€”if I'm reading you right, I would call this a problem of income inequality (cf.), and bucket it with:

Accomplishing this will accelerate adjacent efforts such as improving security and diversity.

andrew commented 8 months ago

and bucket it with:

@chadwhitacre you've got a localhost link there btw ;)

chadwhitacre commented 8 months ago

lolsob, fixed

coni2k commented 8 months ago

@gordonbrander ๐Ÿ‘‡๐Ÿ’ฏ

From a narrative perspective, I suspect "sustainability/fair share" frame might encourage zero-sum thinking. "Our slice of the pie". From my perspective, the pie isn't fixed. Rather, we're looking at a broken feedback loop and a resulting ecological desert.

  • Open source lowers the floor for startups to get to product-market fit, generating money
  • However, very little of that money makes it back to the funding of open source
  • Closing the loop means more open source funding means more open source developers surviving means more open source means lower floor means more startups means more funding...

There's probably some upper limit to this positive ecological feedback, defined by the rate at which a market can absorb new products, but I doubt we've reached that yet.

In other words, asking "How much money do we need to cover the expenses of existing open source maintainers?" is a little misleading. As Gordon says, if we start allocating more resources to open technologies, more companies and individuals will begin producing open technologies (I'm one of the people waiting in the queue; today, I can't become an open source maintainer because I already know there is no money for me).

For example, here is an alternative question: How much of our technologies/software could become open source and still provide some utility? What can be the maximum share of open source in the pie? My answer: (Almost) Every technology/software ~ the entire pie.

Since I learned about open source software, my dream has always been to live in a free and open source world where anyone can use, modify, and contribute to any existing technology or software. That world would be an exciting world to live in, boosting tech innovation, closing the gap between the nations, and potentially addressing crucial social issues.

We are expected to spend 1 trillion dollars on software in 2024, while 99% of the money will go to companies producing closed tech.

So, instead of focusing on existing open source initiatives, here is a challenging question: what must we do to align the incentives for companies to produce their tech as open source? In other words, how can we keep that 1 trillion dollars on the table but get open tech in return for this deal?

There are a few reasons why it is crucial to ask this question:

Overall, I love having these conversations. At some point, should we consider having a dedicated online call to get into details?

Edit: Visualization always helps ๐Ÿ“Š

image

chadwhitacre commented 8 months ago

what must we do to align the incentives for companies to produce their tech as open source?

I see that as what we are working with FSL / DOSP / Software Commons. This aligns with OSI's bylaws:

persuade organizations and software authors to distribute source software freely they otherwise would not distribute

In other words, asking "How much money do we need to cover the expenses of existing open source maintainers?" is a little misleading. As Gordon says, if we start allocating more resources to open technologies, more companies and individuals will begin producing open technologies

I think what we want to come up with is a methodology for calculating a "fair share" (or whatever framing we land on) that is repeatable, both so that it can be independently reproduced, but also so that we can repeat it annually to adjust the corporate fair share over time. If we craft the methodology properly, it should account for changes to the balance between open and proprietary over time.

(I'm one of the people waiting in the queue; today, I can't become an open source maintainer because I already know there is no money for me).

We will get you there some day, @coni2k! :-)

coni2k commented 8 months ago

We will get you there some day, @coni2k! :-)

Indeed! ๐Ÿ’ฏ And having these conversations is a crucial part of the process.

we want to come up with is a methodology for calculating a "fair share"

I will follow the "fair share" part of the conversation. On that topic, I only want to point out that we should consider the possibility of the "OSS production can/should be profitable" scenario and not fix ourselves to only cover the base cost/salaries. We might see "profitability" as the next step in the conversation, but it may also change how we calculate the initial "fair share."

I see that as what we are working with FSL / DOSP / Software Commons.

Allow me to improve my "success" definition then: can we turn the entire pie open source (as previously mentioned, not overnight but over a period):

Put differently, can we have a state that allows tech companies to maximize their profits and software/tech freedom simultaneously? Today, these two aspects contradict each other, not allowing us to reach "maximum utility."

From this perspective:

It might be handy to have a separate discussion on the importance of maximizing software/tech freedom at a macro level, on the startup ecosystem, the pace of innovation, competition, technological independence, etc. Briefly, every friction we add to these transactions has a ripple effect on all these aspects. Hence, having a permissionless digital economy would be my ultimate state.

And, no doubt, the inconvenience/limitation the FSL adds on the consumer side is insignificant compared to a fully proprietary product. Still, FSL is more of an answer to the existing open source initiatives to limit the Free-riding rather than offering a solid incentive for existing proprietary software companies.


I will derail the conversation from defining "success" and expand on how I see the problem and the solution in creating incentives for the existing tech companies to start producing open tech.

Scenario 1: Here is a private goods scenario with proprietary software:

Scenario 2: Here is the open source version without a social contract:

Both scenarios cannot maximize the utility:

The question is, can we find an alternative scenario to address these two aspects and maximize the utility (maximizing revenue and software freedom simultaneously)?

In other words, since open source software is categorically a public good rather than a private good, can we have a new social contract around it and establish a "public goods transaction" by imitating the "private goods transaction"?

Scenario 3: Here is an alternative scenario with "public goods transaction":

There are significant challenges with this new scenario. I will only mention some, but I will not get into details for the sake of simplicity:

However, with this scenario:

In short, the way I see it, if we want to see a transition towards it, we need to start treating open tech production like a business activity, establish an economic model (public goods transaction) around these new goods, and ensure companies won't lose revenue by producing open tech. Other solutions will be either some form of charity or introduce some form of limitation.

My post became longer than expected, but I hope not an exhaustive one. Thank you for the discussion โœŒ