cranhaven / cranhaven.r-universe.dev

WARNING: This is a proof-of-concept idea - it might be removed again
https://cranhaven.r-universe.dev
MIT License
5 stars 0 forks source link

STATS: ~36% of archived packages are unarchived later (2022 study) #5

Open HenrikBengtsson opened 4 months ago

HenrikBengtsson commented 4 months ago

In the 'Reasons why packages are archived on CRAN' blog post on 2022-05-10, @llrs shows how get metadata on different CRAN packages events, including archiving and unarchiving of packages, directly from CRAN. Specifically, this data is available in https://cran.r-project.org/src/contrib/PACKAGES.in.

One of the results of this 2022 study, was:

"This suggests that once a package is archived maintainers do not make the effort to put it back on CRAN except on very few cases were there are multiple attempts. To check we can see the current available packages and see how many of those are still present on CRAN:

CRAN Packages Proportion
no 3869 64%
yes 2183 36%

Many packages are currently on CRAN despite their past archivation but close to 64% are currently not on CRAN.", which vice versa means 36% of archived packages return to CRAN.

In a Bioconductor Slack thread on 2024-03-05 (https://community-bioc.slack.com/archives/CLF37V6C8/p1709643793615939?thread_ts=1709600683.154139&cid=CLF37V6C8), @llrs added:

"Yes, 36% of all packages archived returned to CRAN (when I created the post). As time goes this % will lower, and also it could mean that a package was archived, then returned and then was archived for good. The time they were archived could be calculated comparing the archive and current dates and the date when they were archived. This is relatively trivial to do and could provide some estimation for CRANhaven."

It would be interesting to get the raw data for how long "returning" packages are archived. This information should be possible to retrieve from https://cran.r-project.org/src/contrib/PACKAGES.in because its entries carry information on the type of event and when it took place. Two examples are:

Package: jlmerclusterperm
X-CRAN-History: Archived on 2024-02-29 for policy violation.
  .
  Does not clean up use of cache.
  Unarchived on 2024-03-04.

and

Package: BFS
X-CRAN-History: Archived on 2022-06-14 as check problems were not corrected in time.
  Unarchived on 2022-09-07.
  Archived on 2024-01-24 as requires archived package 'pxweb'.
  Unarchived on 2024-02-02.
  Archived on 2024-02-17 for policy violation.
  .
  On Internet access (429 error).
  Unarchived on 2024-02-24.

With this raw data, we can estimate the distribution of how long packages falls off CRAN before returning.

We could also add annotation to each archived packages with information on why it was archived. For instance, CRANhaven could also serve as a dashboard to get an overview of why packages are no longer available, as an alternative to going into the each CRAN package page.

llrs commented 3 months ago

I think we could get some comments besides the package name. Something like "Archived on 2024-02-17 for policy violation." but to get the "On Internet access (429 error)." on the same line it can be more tricky, I would avoid it.

I have some numbers from the raw data: There are some inconsistencies but there are at least 4704 cases when packages were archived and they later returned to CRAN. These come from 3649 unique packages out of 8837 with at least one event registered on the file. It seems that the number of archived packages has been rising (in line with what other ongoing research would suggest). It would be nice to cross reference with the attempts it takes to be accepted. Overall, median time 30 days:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     8.0    30.0   113.2   109.0  3292.0 

Split by times the package was archived:

 attempt packages     min      q1     mean  median       q3       max
       1     3649  1 days  9 days 121 days 33 days 120 days 3292 days
       2      769  1 days  8 days  91 days 27 days  87 days 1949 days
       3      203  1 days  6 days  80 days 22 days  76 days  882 days
       4       65  1 days  9 days  66 days 23 days  54 days  652 days
       5       16  1 days  3 days  24 days 13 days  31 days   93 days
       6        2 17 days 20 days  22 days 22 days  24 days   27 days

I haven't taken into account #6, but we could deduce 2 weeks if a package was archived close to 20XX/12/31.