bioconda / bioconda-paper

Data analysis related to the bioconda paper
MIT License
4 stars 3 forks source link

Bioconda delay analysis #13

Closed luispedro closed 6 years ago

luispedro commented 7 years ago

As discussed on authorea, this looks at the delay between the upstream release and the bioconda package.

  1. Look for the first commit that introduced the current version of the package. This is the bioconda date.
  2. Attempt to infer the upstream date: if it's a github/pypi release, use the metadata there. Otherwise, download the file and look at mtimes.

This does not work for all packages, but it works for >90% of them. Packages that refer to a particular git commit are ignored (I can also change the code to use the date of that commit).

I then summarized the results (while ignoring upstream packages from before 2016). These are the results I get:

Mean (+/- std.dev.) number of days between upstream release and bioconda package: 102.122157245 +/- 101.787918729
Median number of days between upstream release and bioconda package: 98.0
Based on 1539 packages.

On those packages where it could be heuristically determined, 228 of 358 are current with their upstream release.

(If this gets incorporated into the paper, then I would like to be an author [Luis Pedro Coelho, Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany]. I have also added a recent package to bioconda, so I guess that would also qualify me, except I might have missed that deadline).

As discussed, I also got a (very partial) list of which packages are outdated wrt their upstream releases.

*

BTW, I wasn't sure at all where to put these scripts in this repo. This seemed the most logical place, but let me know if you prefer me to move them.

johanneskoester commented 7 years ago

This is very promising, thanks! In order to get unbiased results, it would be important to only consider packages where the current version is not the first one. The reason is that we only want to measure the delay for a new version to be picked if the package is already in the repo. Otherwise, we get arbitrary biases by the delay between initial inclusion and the last release of the software.

luispedro commented 7 years ago

You are right that we need to be careful with older packages. I had set an arbitrary cut-off for packages earlier than 2016, but you are right that only considering packages with >1 bioconda version is a better system. I will reimplement it like that.

(I will also fix the other issues).

luispedro commented 7 years ago

Updated results (now only considering packages with >1 version on bioconda):

Mean (+/- std.dev.) number of days between upstream release and bioconda package: 83.1293103448 +/- 99.1218653899
Median number of days between upstream release and bioconda package: 37.0
Based on 696 packages.

On those packages where it could be heuristically determined, 160 of 217 are current with their upstream release.
luispedro commented 6 years ago

Following-up on this (and this recent piece of info: https://github.com/bioconda/bioconda-recipes/issues/6323#issuecomment-347994955):

Is this OK now? The big issue of which packages to count has been solved.

johanneskoester commented 6 years ago

Yes, I think it is fine. I will merge it and we will keep it in mind for the first revision.