InsightSoftwareConsortium / ITK

Insight Toolkit (ITK) -- Official Repository. ITK builds on a proven, spatially-oriented architecture for processing, segmentation, and registration of scientific images in two, three, or more dimensions.
https://itk.org
Apache License 2.0
1.42k stars 664 forks source link

VNL and other third-party updates touch a lot of files #2314

Open phcerdan opened 3 years ago

phcerdan commented 3 years ago

For example the last commit updating VNL: https://github.com/InsightSoftwareConsortium/ITK/commit/df92e582d82

image

12 commits in VNL provoke almost 3 million deletions everywhere in ITK It's the update from upstream script having some hiccups? Is it expected?

I have noticed it when grabbing the history of a file, it always has these VNL and sometimes MetaIO massive patches on it, that do not seem related to the file.

For example. history ofitkEigenAnalysis2DImageFilter.h: image

Before 2016-03-02, it wasn't happening.

@blowekamp @hjmjohnson @thewtex

thewtex commented 3 years ago

@phcerdan is your ITK version up-to-date before running the script?

phcerdan commented 3 years ago

@thewtex I am not updating VNL or running the update script, I am just browsing the current git history of ITK. https://github.com/InsightSoftwareConsortium/ITK/commit/df92e582d82

thewtex commented 3 years ago

@hjmjohnson are you aware of style updates, etc., that may have caused this?

hjmjohnson commented 3 years ago

Yes, and a lot of clang-tidy changes as well to VNL core libraries. Yes, there are a lot of minuscule changes to VNL over the past year to fortify the codebase and start the process of modernizing.

To be honest, I don't quite understand how to resolve this issue.

phcerdan commented 3 years ago

I don't think it's a VNL issue. It must be about options/usage of UpdateFromUpstream.sh. Same happened with MetaIO here: https://github.com/InsightSoftwareConsortium/ITK/commit/17e744

mathstuf commented 3 years ago

Is the UpdateFromUpstream.sh in sync with the upstream script? I don't know how ITK uses it.

dzenanz commented 3 years ago

I think @bradking updated it a week ago via 41f4951f6b84cf8e24fe7e4fc7e8ec05f65d9f9d.

bradking commented 3 years ago

Modules/ThirdParty/VNL/UpdateFromUpstream.sh uses the older Utilities/Maintenance/UpdateThirdPartyFromUpstream.sh script. My update was to Utilities/Maintenance/update-third-party.bash.

bradking commented 3 years ago

I'll look at converting Modules/ThirdParty/VNL/UpdateFromUpstream.sh to use the newer script.

bradking commented 3 years ago

I've updated #2324 to switch to the new script.

bradking commented 3 years ago

2324 demonstrated an update with only a few files from VNL changed. Can this be closed?

phcerdan commented 3 years ago

Thanks @bradking for the improvement of the update script in #2324. Before your changes, the old UpdateThirdPartyFromUpstream.sh run by @thewtex didn't either show the pollution reported in this issue. So maybe it was a local issue caused by running the script with outdated sources.

If at some point in the future ITK rewrites its git history, we could solve the pollution generated from older updates of VNL and MetaIO.

I recommend to keep an eye when updating third parties still using the old UpdateThirdPartyFromUpstream.sh method to check that they don't touch 16.000 files.

phcerdan commented 1 year ago

A couple of years have passed, and history is still getting polluted when using UpdateFromStream scripts.

Updated history of the same example file shown in my original post: itkEigenAnalysis2DImageFilter.h. The MetaIO commit touches 16K files (from this PR: https://github.com/InsightSoftwareConsortium/ITK/pull/4116, in particular this commit: https://github.com/InsightSoftwareConsortium/ITK/pull/4116/commits/c42090c681114ddd1e2f090adf14d022ca41787f, note that the changes are not shown in the PR diff).

image

I am able to reproduce this when trying to update Eigen to another branch (from branch for/itk-3.4 to for/itk-master).

diff --git a/Modules/ThirdParty/Eigen3/UpdateFromUpstream.sh b/Modules/ThirdParty/Eigen3/UpdateFromUpstream.sh
index e6c37c161bc..8e72f68af75 100755
--- a/Modules/ThirdParty/Eigen3/UpdateFromUpstream.sh
+++ b/Modules/ThirdParty/Eigen3/UpdateFromUpstream.sh
@@ -10,7 +10,7 @@ readonly subtree="Modules/ThirdParty/Eigen3/src/itkeigen"
 # readonly repo="https://gitlab.com/libeigen/eigen"
 # readonly repo="https://gitlab.kitware.com/third-party/eigen.git"
 readonly repo="https://gitlab.kitware.com/phcerdan/eigen.git"
-readonly tag="for/itk-3.4"
+readonly tag="for/itk-master"
 readonly paths="
 Eigen/Cholesky
 Eigen/CholmodSupport

Here are the logs after running the UpdateFromStream.sh for Eigen3 after that change.

update_from_upstream_touch16k_files.txt

The update seems successful, but it introduces a commit touching 16K files.

commit bec339ee6f2e809752844a8489761ddfe18271b7 (HEAD -> update_eigen_to_master)
Merge: 217a54a40b7 522968224c9
Author: Pablo Hernandez-Cerdan <pablo.hernandez.cerdan@outlook.com>
Date:   Mon Oct 23 00:35:46 2023 +0200

    Merge branch 'upstream-Eigen3' into update_eigen_to_master

    # By Eigen Upstream
    * upstream-Eigen3:
      Eigen3 2023-10-22 (9cd4aec6)

commit 522968224c93dfd0f2ead1eb24d882d7948df40c
Author: Eigen Upstream <kwrobot@kitware.com>
Date:   Sun Oct 22 01:42:43 2023 +0200

    Eigen3 2023-10-22 (9cd4aec6)

    Code extracted from:

        https://gitlab.kitware.com/phcerdan/eigen.git

    at commit 9cd4aec6122b29a37e8de3fbfff47de3af0fc371 (for/itk-master).

See PR #4265 as an example when updating Eigen.

mathstuf commented 1 year ago

Was this branch rebased at some point? Third party import merge commits cannot just be rebased; I typically remove the commit from the rebase -i and replace it with exec ThirdParty/import/update.sh to rerun the import during the rebase.

mathstuf commented 1 year ago

Note that the robot does have a check to enforce that the merge happens properly and will detect and block any rebases like this. It does not seem the robot is enforcing this for ITK at the moment. I'll caution that it means that all changes to third parties must go through the import script.

The git-check-third-party script from this repo will check the state of an import. For the Eigen import it reports:

% git check-third-party Modules/ThirdParty/Eigen3/src/itkeigen
fatal: not imported via a merge commit

This import likely needs redone. Once performed, we can start enforcing all such imports.

State of imports:

phcerdan commented 1 year ago

Was this branch rebased at some point? Third party import merge commits cannot just be rebased;

Yes, it was rebased.

Before:

- Commit2
- Commit1
- HEAD external eigen (release 3.4)

I updated the HEAD of external eigen to master:

After:

- NewCommit
- HEAD external eigen (master)

I typically remove the commit from the rebase -i and replace it with exec ThirdParty/import/update.sh to rerun the import during the rebase.

I would like to try this. However, how do you interact with both ITK (or VTK), and the third party module in that way? I see that the update script creates (and then removes) a work/extract and work/upstream. Is in that work/upstream where you perform that rebase?

The git-check-third-party script from this repo will check the state of an import. For the Eigen import it reports: % git check-third-party Modules/ThirdParty/Eigen3/src/itkeigen fatal: not imported via a merge commit This import likely needs redone. Once performed, we can start enforcing all such imports.

It seems that we should enforce that check from now on in ITK.

mathstuf commented 1 year ago

Is in that work/upstream where you perform that rebase?

The import is not (and should not be) rebased. That is what git rebase does not understand. Using exec to rerun the import just redoes the merge using the proper -Xsubtree option (also not transcribed in the history for things like verification or reuse when reapplying the commit via either cherry-pick or rebase).

It seems that we should enforce that check from now on in ITK.

I agree. The first thing to do would be to get the "local diff" instances' diffs upstreamed into the imported repository. These can then be trivially reimported and should be "fine" according the the script.

I would also recommend using (unannotated) tags in the import repos to preserve imported contents "forever". VTK started using for/vtk-YYYYMMDD[.N]-BASE where BASE is either the tag or something like master-gXXXXXXX with XXXXXXX the shorthash of the base commit and master the branch name. This allows the for/vtk branch to be rebased without worrying about "losing" the source for the import due to repository pruning over time.

phcerdan commented 1 year ago

I have been trying to remove current subtree (itkeigen), and try a fresh update-from-upstream, but getting conflicts in the git read-tree -u --prefix="$subtree-tmp/" "upstream-$name" with the files I have modified upstream (.gitattributes, CMakeLists.txt).

I typically remove the commit from the rebase -i and replace it with exec ThirdParty/import/update.sh

I still don't know how to perform the operation you suggested:

What commit from the ITK history do you remove and replace with the exec update?

Pinging @thewtex

mathstuf commented 1 year ago

What commit from the ITK history do you remove and replace with the exec update?

What is in master is there. I'm talking about when you use git rebase -i and see a bunch of pick commands. I remove the commit related to the import and replace it with exec path/to/update.sh.

I have been trying to remove current subtree (itkeigen), and try a fresh update-from-upstream, but getting conflicts in the git read-tree -u --prefix="$subtree-tmp/" "upstream-$name" with the files I have modified upstream (.gitattributes, CMakeLists.txt).

Chuck added a mode where it detects an empty directory and redoes the import.

thewtex commented 1 year ago

The update seems successful, but it introduces a commit touching 16K files.

We are working with branches that have different histories and subtrees -- the logs may be misleading on how many files are actually being touched.

See PR #4265 as an example when updating Eigen.

The update here looks good at a high level.

I have been trying to remove current subtree (itkeigen), and try a fresh update-from-upstream, but getting conflicts in the git read-tree -u --prefix="$subtree-tmp/" "upstream-$name" with the files I have modified upstream (.gitattributes, CMakeLists.txt).

If both upstream and ITK edited the files, then a conflict resolution may be necessary.

mathstuf commented 12 months ago

the logs may be misleading on how many files are actually being touched.

A rebase put the imported tree into the main ITK history without an -Xsubtree option. The diff of the commit itself is bogus.

If both upstream and ITK edited the files, then a conflict resolution may be necessary.

This ended up being an endless source of extra work when rebasing the import repository. This is why it is preferred that all changes to the import come through update.sh and never through manual patching.

phcerdan commented 12 months ago

Was this branch rebased at some point? Third party import merge commits cannot just be rebased; I typically remove the commit from the rebase -i and replace it with exec ThirdParty/import/update.sh to rerun the import during the rebase.

I understand now. I thought you were referring about the third-party project being rebased. So, no, I am not rebasing inside the ITK master branch. But the third-party project was rebased. Good trick to know though! I tried it for this current situation and get the same result (as it should be, because I am not rebasing ITK history).

The update here looks good at a high level.

It looks fine in the PR, but the first of the two commits is touching all the ITK files.

The diff of the commit itself is bogus.

It might be, but it pollutes the git history of all ITK files, which I think is extremely annoying.


So, at the end I got it solved (I think!).

As background, Eigen used to be stored in bitbucket, and we were using an non-official mirror for git. However, Eigen officially moved to gitlab,, the mirror became obsolete, and all the git hashes changed, like a history rewrite.

That's why I was getting these merge problems, the for/itk-3.4 and for/itk-master have unrelated histories.

So, I locally removed rm -r ./Modules/ThirdParty/Eigen3/src/itkeigen the subtree. Modified https://github.com/phcerdan/ITK/blob/f1c48dc682fbab83563a6455ee548860fda3790f/Utilities/Maintenance/update-third-party.bash#L225

git read-tree -u --prefix="$subtree/" "upstream-$name" for git read-tree -u --prefix="$subtree-delme/" "upstream-$name" And commented out the last two commands:

    git read-tree -u --prefix="$subtree-delme/" "upstream-$name"
fi
# git commit --no-edit
# git branch -d "upstream-$name"

Then move: mv ./Modules/ThirdParty/Eigen3/src/itkeigen-delme ./Modules/ThirdParty/Eigen3/src/itkeigen

git restore Utilities/Maintenance/update-third-party.bash
git add -A
git commit --no-edit
git branch -d upstream-Eigen3

These changes result in only the itkigen subtree files being touched, instead of all ITK files.