datalad / datalad-usage-dashboard

Dashboard of detected usages of DataLad
MIT License
4 stars 2 forks source link

Populate from github search results #1

Closed yarikoptic closed 3 years ago

yarikoptic commented 3 years ago

It is pretty much a regular drill that we either out of curiosity or for reporting purposes run search queries on github to detect possible usages of DataLad and datalad run in particular. Many of our own projects "pollute" the results making it difficult to "filter out" for new hits.

Here is e.g. a list from fresh discussion on Riot Datalad code in python-neo: https://github.com/NeuralEnsemble/python-neo/search?q=datalad Cool: https://github.com/cryo-data/QGreenland https://github.com/neuropoly/data-management https://github.com/htwangtw/publiccrawler https://github.com/Raj-Lab-UCSF/Human_Brain_Atlases The donders has started using datalad: https://github.com/Donders-Institute-Data/dcc.DSC_2018.00127_973_v1 (there are more) -> https://github.com/topics/datalad a whole slew of individual datasets, e.g. https://github.com/sappelhoff/eeg_matchingpennies https://github.com/SIMEXP/fmriprep-reproducibility https://github.com/connectomicslab/CMTKLIB-data and some use datalad run if you search up for RUNCMD and filter out "ours", e.g. https://github.com/neurostuff/simulate-cbma https://github.com/lnnrtwttkhn/tools https://github.com/PennSIVE/agespan7t would be neat to somehow establish a "news feed" for new hits!?

I think in the fashion similar to https://github.com/datalad/datalad-extensions/ and https://github.com/dandi/dandi-api-webshots/ we should have some cron job running which would update some structured (json or yaml) record file and render as a table in README.md something like

In the wild

Repository Stars Dataset run containers-run
https://github.com/cryo-data/QGreenland 1 :heavy_check_mark: :heavy_check_mark:

Inner-circle

.... similar for the ones we identify as "ours" -- from datalad and some other organizations or explicitly annotated as such

"news feed" could be simply commits (with some descriptive commit message) to this repo ("added X new hits for Y datasets and Z usages of run") which we could then channel to Riot room ;)

To keep in mind

jwodder commented 3 years ago

@yarikoptic

If search results include actual commit maybe we could even tell if it was using a container if "singularity" or "docker" is in the record?

Could you give an example of a container-using commit?

yarikoptic commented 3 years ago
  • Other than the searches you've given, should there be any other searches or methods for discovering DataLad users?

I don't know yet any other way/search, may be other @datalad/developers and @datalad/contributors have an idea?

  • What exactly do the "Dataset", "run", and "containers-run" columns of the sample table mean?

I updated description to associate with the query

  • What do you mean by "news feed"?

originally (before I thought about "dashboard" presentation) I thought it could be an old fashion news feed announcing on newly discovered entries. But if we have a git repo, individual new commits will serve that purpose just fine IMHO.

If search results include actual commit maybe we could even tell if it was using a container if "singularity" or "docker" is in the record?

Could you give an example of a container-using commit?

from https://github.com/search?q=DATALAD+RUNCMD+singularity&type=commits a hit to https://github.com/neurostuff/simulate-cbma/commit/bfae73d4fe1f27eeba3561e2867816d0aa57602a . Having extra_inputs in the json record is actually a good indicator that it as containers-run since it passes image. I have verified it also be the case for docker containers with commands like

datalad create /tmp/test-docker-run
cd /tmp/test-docker-run
datalad containers-add -u dhub://neurodebian:nd110 neurodebian
datalad containers-run touch 1
last commit ```shell commit 9aed7811b301ff0c15a23ea134ad1d26b6cc41b0 Author: Yaroslav Halchenko Date: Thu Jul 1 10:50:58 2021 -0400 [DATALAD RUNCMD] python -m datalad_container.adapters.doc... === Do not change lines below === { "chain": [], "cmd": "python -m datalad_container.adapters.docker run .datalad/environments/neurodebian/image touch 1", "dsid": "67c93506-2336-49c9-b884-8e585b56daa5", "exit": 0, "extra_inputs": [ ".datalad/environments/neurodebian/image" ], "inputs": [], "outputs": [], "pwd": "." } ^^^ Do not change lines above ^^^ diff --git a/1 b/1 new file mode 120000 index 0000000..ce38032 --- /dev/null +++ b/1 @@ -0,0 +1 @@ +.git/annex/objects/2W/kW/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e \ No newline at end of file ```
jwodder commented 3 years ago

@yarikoptic

jwodder commented 3 years ago

@yarikoptic Also, in the desired commit message, what exactly is "Z usages of run" counting? The number of datalad run-using repositories that were not found in a previous run of the script? The number of datalad run-using repositories that either were not found in a previous run or else were not using datalad run previously? The number of new datalad run commits made over all of GitHub?

Also, should any effort be made to remove repositories that were found in a previous run of the script but not found by the current run?

yarikoptic commented 3 years ago

yes, doesn't need to be a datalad dataset to run datalad run.

  • So is a commit a container-run commit if & only if the commit metadata has an "extra_inputs" field?

AFAIK yes ATM but might change in the future. that extra_inputs IIRC is not exposed in user interface, and used by containers-run.

Also, should any effort be made to remove repositories that were found in a previous run of the script but not found by the current run?

good question. I guess to be totally kosher we could add one more category in addition to "In the wild" and "Inner circle", and call it "Gone" or alike? So we would just annotate and render separately while still keeping them "on record"

@yarikoptic Also, in the desired commit message, what exactly is "Z usages of run" counting?

since we probably shouldn't blow it into a full tracking of all the commits, let's go for

The number of datalad run-using repositories that either were not found in a previous run or else were not using datalad run previously?