Lots of identical git processes , --batch and not, operating in the same repository -- expected?

yarikoptic commented 2 years ago

Some time during OHBM 2022 with @jsheunis we started metadata extraction on datasets.datalad.org collection (AKA ///) to populate a data catalog. Nearly a week after, when I came home, I found smaug still sweating (at load ~20) doing that drill. Decided to look into those processes and discovered that they all run for the same dataset:

$> ps auxw | grep yoh.*git | awk '{print $2;}' | while read d ; do readlink /proc/$d/cwd; done| uniq -c
     96 /mnt/datasets/datalad/crawl/adhd200/surfaces

$> ps auxw | grep yoh.*git | sed -e 's,.*exe/git,git,g' | sort | uniq -c                            
     16 git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex findref --copies 0 HEAD --json --json-error-messages -c annex.dotfiles=true
      1 git,git,g
     16 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git -c diff.ignoreSubmodules=none annex findref --copies 0 HEAD --json --json-error-messages -c annex.dotfiles=true
     32 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true cat-file --batch
     16 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true cat-file --batch-check=%(objectname) %(objecttype) %(objectsize)
     16 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true ls-tree --full-tree -z -r -- HEAD
      1 yoh       381817  0.0  0.0   6316  2468 pts/16   S+   08:55   0:00 grep --color=auto -d skip yoh.*git

So there is all those git processes working in the same /mnt/datasets/datalad/crawl/adhd200/surfaces . I wonder if that is expected, e.g. due to multiprocessing or smth like that? I would have expected with multiprocessing parallelization would happen across datasets, but I could be wrong.

FWIW -- that dataset has considerable number of files -- almost 300k:

```shell $> git -C /mnt/datasets/datalad/crawl/adhd200/surfaces annex info trusted repositories: 0 semitrusted repositories: 4 00000000-0000-0000-0000-000000000001 -- web 00000000-0000-0000-0000-000000000002 -- bittorrent 54630774-db20-48b2-b5c6-d11340f83105 -- yoh@falkor:/srv/datasets.datalad.org/www/adhd200/surfaces [datalad-public] 7e184ea3-7255-44a6-bb69-32e68c5ea990 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/indi/adhd200/surfaces [here] untrusted repositories: 0 transfers in progress: none available local disk space: 29.98 terabytes (+1 megabyte reserved) local annex keys: 0 local annex size: 0 bytes annexed files in working tree: 285752 size of annexed files in working tree: 340.79 gigabytes bloom filter size: 32 mebibytes (0% full) backend usage: MD5E: 285752 git -C /mnt/datasets/datalad/crawl/adhd200/surfaces annex info 2.14s user 3.95s system 20% cpu 29.111 total ```

edit: FTR running

(git)smaug:/mnt/datasets/datalad/crawl-catalog[master]git
$> tools/extract_filelevel ../crawl/ extracts/filelevel-r.json

christian-monch commented 2 years ago

Thx for the issue, looking into it

christian-monch commented 2 years ago

@yarikoptic: this is a side effect of the function annex_status() (see below), which is called for every file on which a file-level extractor is executed.

def annex_status(annex_repo, paths=None):
    info = annex_repo.get_content_annexinfo(
        paths=paths,
        eval_availability=False,
        init=annex_repo.get_content_annexinfo(
            paths=paths,
            ref="HEAD",
            eval_availability=False,
            init=annex_repo.status(
                paths=paths,
                untracked="no",
                eval_submodule_state="full")
        )
    )
    annex_repo._mark_content_availability(info)
    return info

Execution of this function leads to the execution of the following commands:

1> git -c diff.ignoreSubmodules=none rev-parse --quiet --verify HEAD^{commit}
2> git -c diff.ignoreSubmodules=none ls-files --stage -z -- <file-path>
3> git -c diff.ignoreSubmodules=none ls-files -z -m -d -- <file-path>
4> git -c diff.ignoreSubmodules=none ls-tree HEAD -z -r --full-tree -l -- <file-path>
5> git annex version --raw
6> git -c diff.ignoreSubmodules=none annex findref --copies 0 HEAD --json --json-error-messages -c annex.dotfiles=true
7> git -c diff.ignoreSubmodules=none annex find --copies 0 --json --json-error-messages -c annex.dotfiles=true -- <file-path>

In words: seven subprocesses for every file-level extraction. The first, the fifth, and the sixth could obviously be cached. The other four contain the file path of the file that is operated on and have to be executed for every file path. Nevertheless, there might be an opportunity to opportunistically run them on multiple files at once and cache the results, assuming that an extraction is rarely limited to a single file.

WDYT?

christian-monch commented 2 years ago

BTW: I don't know where the following two processes originate:

     32 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true cat-file --batch
     16 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true cat-file --batch-check=%(objectname) %(objecttype) %(objectsize)

yarikoptic commented 2 years ago

WDYT?

long term - for the extractor I am thinking of https://github.com/datalad/datalad-metalad/issues/257 or RFing extraction to operate on entire tree/list of files via git log --stat or alike going through that history to assign most recent commit for each file. git log per file is way too expensive IMHO.

In both long & short term -- what metadata we aim to extract? may be we could just minimize number of calls to git somehow.

PS smaug is still running extraction of metadata at load 20... it was 2 weeks IIRC

christian-monch commented 2 years ago

PS smaug is still running extraction of metadata at load 20... it was 2 weeks IIRC

Thx. Will check the numbers and see whether this is expected (probably not).

christian-monch commented 2 years ago

I cannot properly analyze the problem because I have no access to /mnt/datasets/datalad/crawl-catalog and the tools and extracts directory.

yarikoptic commented 2 years ago

now you (and others in datalad group) should have read-only access to everything there

christian-monch commented 2 years ago

now you (and others in datalad group) should have read-only access to everything there

Thx

christian-monch commented 2 years ago

I still cannot read /mnt/datasets/datalad/crawl/abide/.git/index, which leads to a failing traversal.

christian-monch commented 2 years ago

I am looking at /mnt/btrfs/datasets-metalad-cm/datalad/crawl instead.

It contains 28,5 million files and directories.

time find /mnt/btrfs/datasets-metalad-cm/datalad/crawl|wc -l
28530451

real    62m22.862s
user    0m46.760s
sys     2m10.747s

With filtering out .git content (and a drastically reduced runtime. hurray for caches I guess ;-)) there are still 20 million files:

> time find /mnt/btrfs/datasets-metalad-cm/datalad/crawl |grep -v \\\.git|wc -l
20162909

real    2m10.793s
user    0m47.851s
sys     1m10.549s

The dataset traverser emits 700 entities per minute. It should therefore take about 20 days to traverse 20.000.000 files. :-(

That is not good! I will into the traverser performance.

yarikoptic commented 2 years ago

/mnt/datasets/datalad/crawl/abide/.git/index

fixing that ... check again tomorrow or so -- running across entire archive

(git)smaug:/mnt/datasets/datalad/crawl[master]git
$> echo * | xargs -n 1 -P 4 chmod g+rX -R

christian-monch commented 2 years ago

I have started an attempt to solve this issue with a caching command server. I.e. a process that provides command execution for its clients, caches the results, and returns cached results if the "same" command is executed twice (see https://github.com/datalad/datalad-metalad/issues/268#issuecomment-1192369249)

yarikoptic commented 2 years ago

Hmm, not sure why such caching is needed (instead of code fixing so there is no duplicate calls) and when it is safe to use it - many calls are dependent on external state of the repository etc.

datalad / datalad-metalad

Lots of identical git processes , --batch and not, operating in the same repository -- expected? #261