Closed jwodder closed 2 years ago
yikes, nice catch. and wow -- we have them repeated up to 48 times!
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ grep submodule .gitmodules | sort | uniq -c | sort -n | tail
48 [submodule "000245"]
48 [submodule "000246"]
48 [submodule "000249"]
48 [submodule "000251"]
48 [submodule "000252"]
48 [submodule "000255"]
48 [submodule "000288"]
48 [submodule "000290"]
48 [submodule "000292"]
48 [submodule "000293"]
did you figure out how that could have happened? git submodule seems to not allow for that
(dandisets) dandi@drogon:/tmp/dandisets$ git submodule add https://github.com/dandisets/000255.git 000255
fatal: '000255' already exists in the index
if we can't figure out what causes it, lets proliferate code with calls to some naive assert_no_duplicates_in_gitmodules
function to call before/after every location which could potentially modify that file, like that `submodule add . Eventually it should raise an exception I guess
it suggests that #221 was culprit since right after that addition of a new submodule 000288 in cc5b320d9e9dc3f8e7a7fd0edc957519302c5157 added that huge copy for the first time.
note that may be around the same point in time we might have upgraded git-annex in conda-forge and thus may be git as well...
for now disabled that cronjob so whenever it completes we could safely bring back .gitmodules to sensible state.
@yarikoptic It appears that the duplication is due to DataLad. When adding a subdataset, backups2datalad ensures that the necessary section is present in .gitmodules
, and then it calls datalad save path/to/subdataset .gitmodules
, and at this point DataLad adds the duplicate section. This is interesting, as the whole reason I had to make backups2datalad add the section to .gitmodules
itself was because DataLad wasn't doing it.
oh - nice find! could you please file an issue against datalad for that with some reproducer (could just use some existing superdataset from anywhere, even dandisets from github)?
But also why we don't use datalad save -d path/to/dandisets path/to/subdataset
form (i.e specifying the dandisets
superdataset)?
Issue filed: https://github.com/datalad/datalad/issues/6843
But also why we don't use
datalad save -d path/to/dandisets path/to/subdataset
form (i.e specifying the dandisets superdataset)?
Because I never considerd that -d path/to/current/dataset
would make a difference when running in that dataset. Why does it?
because that is how some (extended more on that issue you filed https://github.com/datalad/datalad/issues/6775#issuecomment-1185791134) commands do not modify current dataset with changes about (possible) subdatasets unless they have -d
provided since they "scope" into operating within those datasets provided in target PATH
. Specifying -d
scopes operation to perform within that dataset, thus saving changed state of the subdataset into it. So, I guess the situation could be addressed (before we address/release for that new issue you filed) simply by providing our top level dataset, and avoiding messing with .gitmodules
"manually".
Over the course of several commits starting on the eleventh, the
.gitmodules
file in this repository somehow ended up with its content appended to itself (in a jumbled order) multiple times, leading to warnings from Git. The file needs to be cleaned up and the cause dealt with. My best guess is that something went wrong with this check for whether a submodule was already registered.