datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

RF: Adjust for add-archive-content refactoring in datalad core #112

Closed adswa closed 2 years ago

adswa commented 2 years ago

https://github.com/datalad/datalad/pull/6105 refactors add-archive-content to be a dataset method. This requires changes in datalad-crawlers use/adaptation of the function. For one, we need to pass a dataset instance. Secondly, I had to disable the intergrity check of 'annex', which used to be returned by add-archive-content, but isn't anymore.

Locally, this makes the changes in https://github.com/datalad/datalad/pull/6105 not break any crawler tests anymore. It would be great to have your opinion on this, @yarikoptic.

codecov[bot] commented 2 years ago

Codecov Report

Merging #112 (a7321c8) into master (e59722a) will decrease coverage by 0.00%. The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #112      +/-   ##
==========================================
- Coverage   81.79%   81.78%   -0.01%     
==========================================
  Files          59       59              
  Lines        4768     4771       +3     
==========================================
+ Hits         3900     3902       +2     
- Misses        868      869       +1     
Impacted Files Coverage Δ
datalad_crawler/nodes/annex.py 80.99% <83.33%> (-0.07%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e59722a...a7321c8. Read the comment docs.

adswa commented 2 years ago

@yarikoptic I do need your help with one other thing. I think it is related to the Annexificator. After https://github.com/datalad/datalad/pull/6105/commits/ab852c43ea591618191ce15fe7b8906bcbe65801 and https://github.com/datalad/datalad/pull/6105/commits/ee83851fb7be69f36e16dcec8ed0a69583604c8f (a refactoring to use ensure_datalad_remote (using repo.get_special_remotes() internally) to check for preexisting datalad-archives remotes), there is one crawler test that fails:


======================================================================
ERROR: datalad_crawler.nodes.tests.test_annex.test_add_archive_content_tar
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/adina/env/handbook2/lib/python3.9/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/adina/repos/datalad/datalad/tests/utils.py", line 1163, in _wrap_assert_cwd_unchanged
    raise exc_info[1]
  File "/home/adina/repos/datalad/datalad/tests/utils.py", line 1135, in _wrap_assert_cwd_unchanged
    ret = func(*args, **kwargs)
  File "/home/adina/repos/datalad/datalad/tests/utils.py", line 558, in _wrap_with_tree
    return t(*(arg + (d,)), **kw)
  File "/home/adina/repos/datalad-crawler/datalad_crawler/nodes/tests/test_annex.py", line 227, in test_add_archive_content_tar
    output_addarchive = list(
  File "/home/adina/repos/datalad-crawler/datalad_crawler/nodes/annex.py", line 1275, in _add_archive_content
    add_archive_content(
  File "/home/adina/repos/datalad/datalad/interface/utils.py", line 484, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/home/adina/repos/datalad/datalad/interface/utils.py", line 476, in return_func
    results = list(results)
  File "/home/adina/repos/datalad/datalad/interface/utils.py", line 396, in generator_func
    for r in _process_results(
  File "/home/adina/repos/datalad/datalad/interface/utils.py", line 579, in _process_results
    for res in results:
  File "/home/adina/repos/datalad/datalad/interface/add_archive_content.py", line 401, in __call__
    ensure_datalad_remote(ds.repo, remote=ARCHIVES_SPECIAL_REMOTE,
  File "/home/adina/repos/datalad/datalad/customremotes/base.py", line 590, in ensure_datalad_remote
    init_datalad_remote(repo, remote,
  File "/home/adina/repos/datalad/datalad/customremotes/base.py", line 560, in init_datalad_remote
    return repo.init_remote(remote, remote_opts + opts)
  File "/home/adina/repos/datalad/datalad/support/annexrepo.py", line 1878, in init_remote
    self.call_annex(['initremote'] + [name] + options)
  File "/home/adina/repos/datalad/datalad/support/annexrepo.py", line 1170, in call_annex
    return self._call_annex(
  File "/home/adina/repos/datalad/datalad/support/annexrepo.py", line 924, in _call_annex
    return runner.run(
  File "/home/adina/repos/datalad/datalad/runner/runner.py", line 145, in run
    raise CommandError(
datalad.runner.exception.CommandError: CommandError: 'git -c diff.ignoreSubmodules=none -c annex.alwayscommit=false annex initremote datalad-archives encryption=none type=external autoenable=true externaltype=datalad-archives uuid=c04eb54b-4b4e-5755-8436-866b043170fa -c annex.dotfiles=true' failed with exitcode 1 under /tmp/datalad_temp_tree_test_add_archive_content_tariz6chk26 [err: 'git-annex: There is already a special remote named "datalad-archives". (Use enableremote to enable an existing special remote.)']

----------------------------------------------------------------------
Ran 10 tests in 11.709s

i.e., the check for already existing special remotes failed to detect the one in the test repo. Digging into why this may be, I found something weird: repo.get_special_remotes returns a all known enabled and unenabled special remotes by querying the remote.log of the git-annex branch. In the created test repo, this fails. Here is the relevant debug output:

datalad.runner.runner: DEBUG  : Finished ['git', '-c', 'diff.ignoreSubmodules=none', 'cat-file', 'blob', 'git-annex:remote.log'] with status 128
datalad.dataset.gitrepo: Level 11: CommandError: 'git -c diff.ignoreSubmodules=none cat-file blob git-annex:remote.log' failed with exitcode 128 under /tmp/datalad_temp_tree_test_add_archive_content_tarutj6f3bz [err: 'fatal: Not a valid object name git-annex:remote.log']

Further looking into the test repo, it appears that the repo is something funky. It appears to have a master branch:

adina@muninn in /tmp/datalad_temp_tree_test_add_archive_content_tarr7fknqkb on git:master+
❱ git st
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   1.tar

adina@muninn in /tmp/datalad_temp_tree_test_add_archive_content_tarr7fknqkb on git:master+

... but yet it doesn't?

adina@muninn in /tmp/datalad_temp_tree_test_add_archive_content_tarr7fknqkb on git:master+
❱ git branch
  git-annex
❱ ls .git/refs/heads
git-annex

And the git-annex branch isn't actually a git-annex branch:

adina@muninn in /tmp/datalad_temp_tree_test_add_archive_content_tarr7fknqkb on git:master+
❱ git co git-annex                                                          1 !
A   1.tar
Switched to branch 'git-annex'
adina@muninn in /tmp/datalad_temp_tree_test_add_archive_content_tarr7fknqkb on git:git-annex+
❱ ls
1.tar

This is because the repository has not a single commit yet. The 1.tar archive is only staged, there is no initial commit, and no real git-annex branch has been established, and thus not the relevant remote.log.

Previously, add-archive-content relied on annex.get_remotes() to check against pre-existing remotes (thereby apparently missing initialized but unenabled special remotes (datalad#1693), but succeeding in this special case of repo). get_remotes queries .git/config for remotes, instead of checking git-annex's remote.log. How can I ensure that the special remote is found, without reverting https://github.com/datalad/datalad/pull/6105/commits/ab852c43ea591618191ce15fe7b8906bcbe65801 and https://github.com/datalad/datalad/pull/6105/commits/ee83851fb7be69f36e16dcec8ed0a69583604c8f? I think I'm missing something about what this test setup does or is supposed to do.

adswa commented 2 years ago

With datalad/datalad#6135 merged, I think this one is ready to go