dandi / dandisets

737 Dandisets, 812.2 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

.gitmodules contains duplicate content #223

Closed jwodder closed 2 years ago

jwodder commented 2 years ago

Over the course of several commits starting on the eleventh, the .gitmodules file in this repository somehow ended up with its content appended to itself (in a jumbled order) multiple times, leading to warnings from Git. The file needs to be cleaned up and the cause dealt with. My best guess is that something went wrong with this check for whether a submodule was already registered.

yarikoptic commented 2 years ago

yikes, nice catch. and wow -- we have them repeated up to 48 times!

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ grep submodule .gitmodules  | sort | uniq -c | sort -n | tail
     48 [submodule "000245"]
     48 [submodule "000246"]
     48 [submodule "000249"]
     48 [submodule "000251"]
     48 [submodule "000252"]
     48 [submodule "000255"]
     48 [submodule "000288"]
     48 [submodule "000290"]
     48 [submodule "000292"]
     48 [submodule "000293"]

did you figure out how that could have happened? git submodule seems to not allow for that

(dandisets) dandi@drogon:/tmp/dandisets$ git submodule add https://github.com/dandisets/000255.git 000255
fatal: '000255' already exists in the index
yarikoptic commented 2 years ago

if we can't figure out what causes it, lets proliferate code with calls to some naive assert_no_duplicates_in_gitmodules function to call before/after every location which could potentially modify that file, like that `submodule add . Eventually it should raise an exception I guess

yarikoptic commented 2 years ago
looking at this git log ```shell commit c3274c9cd9c85deedeb866d2594be96bdab7ba8b Author: DANDI Team Date: Mon Jul 11 16:37:02 2022 -0400 CRON update .gitmodules | 696 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 696 insertions(+) commit cc5b320d9e9dc3f8e7a7fd0edc957519302c5157 Author: DANDI Team Date: Mon Jul 11 15:41:20 2022 -0400 CRON update .gitmodules | 696 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 000288 | 2 +- 2 files changed, 697 insertions(+), 1 deletion(-) commit ca45e044261596ea6c791fc0bb85dc2853692e4b Merge: b249a39 041d245 Author: Yaroslav Halchenko Date: Mon Jul 11 13:21:29 2022 -0400 Merge pull request #221 from dandi/gh-151 Set descriptions for Zarr GitHub repos as soon as they're synced commit 041d245ec978b34b1efc5f130ac34450a220968f (github/gh-151) Author: John T. Wodder II Date: Mon Jul 11 11:44:00 2022 -0400 [skip ci] Add docstring tools/backups2datalad/manager.py | 4 ++++ 1 file changed, 4 insertions(+) commit cedc5c26462b7cfa1b40a21bfe2f2b574d32b4ec Author: John T. Wodder II Date: Mon Jul 11 10:59:17 2022 -0400 Set descriptions for Zarr GitHub repos as soon as they're synced tools/backups2datalad.req.txt | 2 +- tools/backups2datalad/adataset.py | 39 ++++++++++++++++++++++++++++++++++++++- tools/backups2datalad/asyncer.py | 124 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------ tools/backups2datalad/datasetter.py | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------------------------------------------------------------------------------------------------------- tools/backups2datalad/manager.py | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ tools/backups2datalad/syncer.py | 18 +++++++++++++----- tools/backups2datalad/util.py | 51 --------------------------------------------------- tools/backups2datalad/zarr.py | 54 ++++++++++++++++++++++++++++++------------------------ tools/test_backups2datalad/conftest.py | 2 +- tools/test_backups2datalad/test_commands.py | 2 +- tools/test_backups2datalad/test_core.py | 2 +- tools/test_backups2datalad/test_zarr.py | 15 +++++++++------ 12 files changed, 346 insertions(+), 266 deletions(-) commit b249a398645256071acd5c925e2ea9e90906b3ff Author: DANDI Team Date: Fri Jul 8 13:46:41 2022 -0400 CRON update 000292 | 2 +- 000293 | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) ```

it suggests that #221 was culprit since right after that addition of a new submodule 000288 in cc5b320d9e9dc3f8e7a7fd0edc957519302c5157 added that huge copy for the first time.

note that may be around the same point in time we might have upgraded git-annex in conda-forge and thus may be git as well...

yarikoptic commented 2 years ago

for now disabled that cronjob so whenever it completes we could safely bring back .gitmodules to sensible state.

jwodder commented 2 years ago

@yarikoptic It appears that the duplication is due to DataLad. When adding a subdataset, backups2datalad ensures that the necessary section is present in .gitmodules, and then it calls datalad save path/to/subdataset .gitmodules, and at this point DataLad adds the duplicate section. This is interesting, as the whole reason I had to make backups2datalad add the section to .gitmodules itself was because DataLad wasn't doing it.

yarikoptic commented 2 years ago

oh - nice find! could you please file an issue against datalad for that with some reproducer (could just use some existing superdataset from anywhere, even dandisets from github)?

But also why we don't use datalad save -d path/to/dandisets path/to/subdataset form (i.e specifying the dandisets superdataset)?

jwodder commented 2 years ago

Issue filed: https://github.com/datalad/datalad/issues/6843

But also why we don't use datalad save -d path/to/dandisets path/to/subdataset form (i.e specifying the dandisets superdataset)?

Because I never considerd that -d path/to/current/dataset would make a difference when running in that dataset. Why does it?

yarikoptic commented 2 years ago

because that is how some (extended more on that issue you filed https://github.com/datalad/datalad/issues/6775#issuecomment-1185791134) commands do not modify current dataset with changes about (possible) subdatasets unless they have -d provided since they "scope" into operating within those datasets provided in target PATH. Specifying -d scopes operation to perform within that dataset, thus saving changed state of the subdataset into it. So, I guess the situation could be addressed (before we address/release for that new issue you filed) simply by providing our top level dataset, and avoiding messing with .gitmodules "manually".