dandi / dandisets-healthstatus

Healthchecks of dandisets and support libraries (pynwb and matnwb)
0 stars 1 forks source link

Skip embargoed datasets somehow? #73

Closed yarikoptic closed 6 months ago

yarikoptic commented 7 months ago

after interrupt and restart of drogon ran into

dandi@drogon:~/cronlib/dandisets-healthstatus$ tools/run_loop.sh
Username for 'https://github.com': Username for 'https://github.com': Username for 'https://github.com': Username for 'https://github.com': Username for 'https://github.com':

which is due to attempt to access/clone private repos through one big

dandi     332698  0.9  0.0 1482080 36972 pts/5   Sl+  12:20   0:16                         /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/bin/datalad get -d /mnt/backup/dandi/dandisets-healthstatus/dandisets -r -R1 -J5 -n

we do not annotate for a dandiset being private or not within dandisets/.gitmodules so can't really programmatically do filter them out...

Cheap trick to prevent git from asking anything is GIT_TERMINAL_PROMPT=0 but then I am not sure what would be our reaction from fuse layer.

@jwodder any ideas on what we should do here?

jwodder commented 7 months ago

@yarikoptic What sort of filtering does datalad get support? If embargo-ness were stored in .gitmodules, how would you use it for filtering?

yarikoptic commented 7 months ago

not much if any (only that depth -R which we use alredy) . Filed

I guess we would just need to do smth like some adhoc get-list-of-submodules-without | xargs datalad get -d PATH

yarikoptic commented 7 months ago

for now I will try a run with following patch

diff --git a/code/src/healthstatus/__main__.py b/code/src/healthstatus/__main__.py
index 2bc4b79..1d59451 100644
--- a/code/src/healthstatus/__main__.py
+++ b/code/src/healthstatus/__main__.py
@@ -1,6 +1,7 @@
 from __future__ import annotations
 from collections import Counter, defaultdict
 import logging
+import os
 from operator import attrgetter
 from pathlib import Path
 import re
@@ -64,6 +65,14 @@ def check(
     mode: str,
 ) -> None:
     log.info("Updating Dandisets dataset ...")
+    # We have embargoed dandisets which should not be cloned
+    # so we want to prevent git asking password and have to avoid non-0 exit
+    # from datalad
+    # Ref: https://github.com/dandi/dandisets-healthstatus/issues/73
+    kw = dict(
+        env=dict(GIT_TERMINAL_PROMPT='0', **os.environ),
+        check=False,
+    )
     subprocess.run(
         [
             "datalad",
@@ -76,11 +85,11 @@ def check(
             "-r",
             "-R1",
         ],
-        check=True,
+        **kw
     )
     subprocess.run(
         ["datalad", "get", "-d", str(dataset_path), "-r", "-R1", "-J5", "-n"],
-        check=True,
+        **kw
     )
     installer = MatNWBInstaller(MATNWB_INSTALL_DIR)
     installer.install(update=True)
yarikoptic commented 6 months ago

@jwodder Let's code explicit looping through subdatasets here to get stuff updated - get a full list of datasets for the organization, exclude private, get (new) or update (existing) ones.

jwodder commented 6 months ago

@yarikoptic

get (new) or update (existing) ones.

Is this doable by just adding all public datasets as arguments to the update and get commands, or is something more involved needed?

yarikoptic commented 6 months ago

thinking about it , although I can be wrong, I think we probably didn't have issue with datalad upgrade since it would not install new subdatasets... so it must have been the get. For get, since we are coding in python anyways already, let's indeed just pass as arguments BUT only those which are not already installed, so take difference from output of datalad subdatasets --state present (in python equivalent) and if any -- give to get. If none -- no need to run get at all