FreshPorts / freshports

The website part of FreshPorts
http://www.freshports.org/
BSD 2-Clause "Simplified" License
70 stars 24 forks source link

Feature: "Port-Impact" #581

Open tcberner opened 4 months ago

tcberner commented 4 months ago

Moin moin

portmgr had the idea some time ago to try to measure how much impact a given port has.

This could be used to gauge whether a port should be maintained by a group (bus-factor), or give some idea of when to do exp-runs.

A first stab at this could be to count the reverse-dependencies of given port. That would for give a higher importance to devel/cmake than to say www/firefox. However, as is quite obvious this example shows that this metric is not enough. As it won't give any importance to leaf-ports like www/firefox.

A suggestion by dvl was to also consider the "watchers" of a given port on freshports -- which could help give some weight to important leaf-ports.

mfg Tobias

dlangille commented 4 months ago

Let's take gmake as the first example:

freshports.dvl=# select id, name, category, element_pathname(element_id) from ports_active where name = 'gmake' and category = 'devel';
 id  | name  | category |    element_pathname     
-----+-------+----------+-------------------------
 239 | gmake | devel    | /ports/head/devel/gmake

OK, that's the right port, let's get totals:

freshports.dvl=# select dependency_type, count(*) from port_dependencies where port_id_dependent_upon = (select id from ports_active where name = 'gmake' and category = 'devel')  group by dependency_type order by dependency_type;
 dependency_type | count 
-----------------+-------
 B               |  8286
 P               |     4
 R               |    50
 T               |    15
(4 rows)
dlangille commented 4 months ago

Based on the above, gmake is:

Over all, it is used by 8355 ports

freshports.dvl=# select count(*) from port_dependencies where port_id_dependent_upon = (select id from ports_active where name = 'gmake' and category = 'devel');
 count 
-------
  8355
(1 row)
dlangille commented 4 months ago

The top 20 ports:

freshports.dvl=# select getport(port_id_dependent_upon), count(*) from port_dependencies group by port_id_dependent_upon order by count(*) desc limit 20;
              getport              | count 
-----------------------------------+-------
 /ports/head/lang/python39         | 13758
 /ports/head/devel/ruby-gems       | 11363
 /ports/head/lang/perl5.32         | 11223
 /ports/head/devel/py-setuptools   | 10132
 /ports/head/devel/gmake           |  8355
 /ports/head/devel/pkgconf         |  8086
 /ports/head/devel/gettext-runtime |  6594
 /ports/head/lang/ruby30           |  5676
 /ports/head/x11/libX11            |  5515
 /ports/head/lang/ruby32           |  5179
 /ports/head/devel/ninja           |  4358
 /ports/head/lang/python27         |  4267
 /ports/head/lang/python311        |  4214
 /ports/head/devel/glib20          |  3418
 /ports/head/devel/cmake-core      |  3289
 /ports/head/devel/gettext-tools   |  3242
 /ports/head/devel/autoconf        |  2911
 /ports/head/lang/perl5.36         |  2642
 /ports/head/x11/libXext           |  2530
 /ports/head/x11-toolkits/pango    |  2421
(20 rows)

freshports.dvl=# 
dlangille commented 4 months ago
[20:07 pg03 dvl ~] % echo 'select getport(port_id_dependent_upon), count(*) from port_dependencies group by port_id_dependent_upon order by count(*) desc' | psql freshports.dvl > popular
wc -l p%                                                                                                                                                                                                  [20:07 pg03 dvl ~] % wc -l popular 
   26837 popular

The full output is at https://gist.github.com/dlangille/9f95843f5d49d44b670497ee0a0fd81d

WARNING: 3.43M

dlangille commented 4 months ago

Issues:

dlangille commented 4 months ago

This output features active ports only

[20:14 pg03 dvl ~] % echo 'select getport(PD.port_id_dependent_upon), count(*) from port_dependencies PD join ports_active PA on PA.id = PD.port_id_dependent_upon  group by port_id_dependent_upon order by count(*) desc' | psql freshports.dvl > popular 
[20:14 pg03 dvl ~] % wc -l popular
   15733 popular

Output at:

https://gist.github.com/dlangille/a22b87bcb44126e118c4304d185fe1c4

dlangille commented 4 months ago

We can consider the top-20 most watched ports:

freshports.dvl=# select element_pathname(WLE.element_id), count(*) from watch_list_element WLE join ports_active PA on WLE.element_id = PA.element_id group by WLE.element_id order by count(*) desc limit 20;
         element_pathname         | count 
----------------------------------+-------
 /ports/head/devel/gmake          |   738
 /ports/head/converters/libiconv  |   737
 /ports/head/devel/gettext        |   721
 /ports/head/textproc/expat2      |   715
 /ports/head/print/freetype2      |   676
 /ports/head/graphics/png         |   676
 /ports/head/devel/m4             |   674
 /ports/head/archivers/unzip      |   633
 /ports/head/textproc/libxml2     |   576
 /ports/head/devel/pcre           |   573
 /ports/head/graphics/tiff        |   545
 /ports/head/lang/python          |   543
 /ports/head/misc/help2man        |   518
 /ports/head/ftp/wget             |   514
 /ports/head/devel/bison          |   502
 /ports/head/security/nmap        |   494
 /ports/head/security/sudo        |   492
 /ports/head/devel/popt           |   491
 /ports/head/x11-fonts/fontconfig |   487
 /ports/head/mail/postfix         |   466
(20 rows)
dlangille commented 3 months ago

Things to do:

dlangille commented 3 months ago

Here is the new approach. This query takes < 20ms to run.

[11:53 pg03 dvl ~] % echo " with PDC as (     
select PD.port_id_dependent_upon as port_id, count(*) as count
from port_dependencies PD
group by PD.port_id_dependent_upon )

select split_part(EP.pathname, '/ports/head/', 2) as name, P.maintainer, count
  FROM ports P join PDC on P.id = PDC.port_id
               JOIN element_pathname EP ON P.element_id = EP.element_id
                     and EP.pathname like '/ports/head/%'
group by name, maintainer, count
having count > 500
ORDER BY count desc; " | psql freshports.dvl > ports.txt
dlangille commented 3 months ago

What it looks like:

              name              |       maintainer       | count 
--------------------------------+------------------------+-------
 lang/python39                  | python@FreeBSD.org     | 13427
 devel/ruby-gems                | ruby@FreeBSD.org       | 11384
dlangille commented 3 months ago

Uploaded here: https://people.freebsd.org/~dvl/ports-impact.txt

dlangille commented 3 months ago

If this proves useful, it can be automated to update on a regular basis.

grahamperrin commented 3 months ago

… try to measure how much impact

For what it's worth (not to complicate this issue), it might be useful to treat:

– for as long as version 30 will be required to build Signal net-im/signal-desktop, which has some passionate users.


There might be any number of other cases that are not easily measurable in a way that corresponds with reported effects on end users. Signal comes to mind only because I'm aware of things being relatively noisy in and around package infrastructure bug 270565, where (understandably) no more than one version of Electron is built, at this time.

dlangille commented 3 months ago

How do we code that without special casing it? Is there something in the port we can detect?

grahamperrin commented 3 months ago

I'm no expert, but I can't think of anything detectable.

∑ (watch list counts) are low: 2 for electron30, 11 for signal-desktop.

mirror176 commented 2 months ago

If the pkg servers keep tallies on how many downloads each package gets, that would be more useful but a 'proper' list needs other data or exceptions as not all ports have permissive distribution by public pkg repos. If any ports have public download counts for their master_sites then that could be turned into a metric but sounds like similar difficulty as automatically checking program versions available from the original source. It would have different but relevant values of both total count and relative count. The reasons why those are separate is some downloads get hosted through 3rd parties, unofficial mirrors/sites, p2p, etc. where download counts are not maintained so each may have relevance.

Another freshports metric that could be gathered would be page views per port. Its value could be debatable as I'm sure I've used ports that I never visited a freshpports entry for and sometimes I view a page but I never install it.

Watchlist still seems like a better count as it means someone took the time to say 'i care about this port' where my main fear is too low of a logged in freshports userbase. Having an automated way to transfer a list of non-automatic packages from a system to freshports would help make that more complete + up to date but it would further benefit from dependencies getting counted too to try to follow flavors and build option differences. Similarly I've thought there could be value in tracking build options just to get a user count of different port options and the user-bases that form around specific divisions. Obviously trying to gather any of these metrics has difficulties both from users security and privacy being impacted when program installation records are tracked+shared.

Originally I was expecting this to be more about the build time and resources instead of userbase. Poudriere build runs log time but that can vary depending on the hardware it runs on, whether compiler caching helped, and for local builders there are issues with how much RAM, what is in RAM vs on disk, what disk speed, etc. It will be great if such metrics also come into play but they will be even more valuable in the ports tree itself so that a builder like poudriere to set how many of which jobs to run in what order and when (preemptively) for downloading, extracting, compiling, packaging, etc. to better utilize resources; considerations I thought about as I thought about writing a competitor/replacement before poudriere existed but never got anywhere for useful code. A 'basic' runtime of the port and summary of its dependencies for build times may be interesting but I don't know how useful it will be when differences get considered, specifically CPU+threads, RAM, and cache; could be attached or laid out similar to the latest version table.