Remove git submodule for testing data

daljit46 commented 7 months ago

This PR aims to address #2837. The idea is to remove the Git submodule for testing data from the codebase. Instead, testing data will be cloned in the build directory at build time. This is made possible by using CMake's ExternalProject.

github-actions[bot] commented 7 months ago

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] commented 7 months ago

clang-tidy review says "All clean, LGTM! :+1:"

Lestropie commented 7 months ago

Makes sense to me. Hopefully this would also by design remove some of the weird annoyances around submodules not changing commit hash on main repo branch change / getting mistakenly included in git commit -a. I'm also thinking that any divergences in test data between code branches should be less likely.

The outstanding relevant question is whether there should remain two test data repositories, or whether they should be merged into one. (And the "unit test" data added in #2678 would naturally go there also) But that can perhaps be addressed as part of #2824.

Also: Would this be a preferable alternative to submodules for other external dependencies (#2584)? I suppose the disadvantage there is that IDEs wouldn't be able to find the code for the relevant functions unless you put the code somewhere else and pointed the IDE to it yourself.

Lestropie commented 7 months ago

There might be a slight disadvantage to developer workflow, but it is tolerable.

Whenever an update to the test data is required, typically one would:

Navigate into the sub-directory occupied by the test data
Make changes to the test data
Navigate back to root
Ensure that run_tests succeeds
Navigate back into the sub-directory occupied by the test data
Generate commit and push
Navigate back to root
Stage submodule commmit hash
Generate commit that may also include corresponding changes to the tests

Here, I presume the workflow will be:

Navigate into the sub-directory occupied by the test data
Make changes to the test data
Navigate back to build directory
Ensure that ctest succeeds
Externally clone the test data repository
Explicitly match the commit hash of that clone with what the source repository was pointing to
Duplicate the changes that were made in the build directory's clone of the test data
Generate commit and push
Navigate back to root of source repository
Stage external project commmit hash
Generate commit that may also include corresponding changes to the tests

?

daljit46 commented 7 months ago

Also: Would this be a preferable alternative to submodules for other external dependencies (#2584)? I suppose the disadvantage there is that IDEs wouldn't be able to find the code for the relevant functions unless you put the code somewhere else and pointed the IDE to it yourself.

I think for this purpose, FetchContent is more suitable than ExternalProject as it makes content available immediately allowing it to be used in the configuration stage. Something like this would work:

include(FetchContent)

FetchContent_Declare(json URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz)
FetchContent_MakeAvailable(json)

target_link_libraries(mrtrix-core PRIVATE nlohmann_json::nlohmann_json)

and IDEs shouldn't have trouble picking this up.

daljit46 commented 2 months ago

Would like to get this merged too unless there are any objections.

jdtournier commented 2 months ago

Before we do, I'm not sure I get the rationale.

My understanding of the original issue was that the testing ought to run from within the build directory. I kind of get that bit. But then there's a suggestion to download the test data into the build directory, and that's where I get a bit lost. Does that mean that if we delete the build folder, the system will have to download all the test data again? That's a fair bit of data... I'm hoping that the ExternalProject feature somehow gets around that issue? Otherwise I think this is going to prove very impractical...

Hopefully this is just my ignorance about how that feature works, and a quick bit of education will sort that out...

daljit46 commented 2 months ago

Yes, the data will be cloned in the build directory. However, the original issue #2837 was to address the problem that our source directory can get polluted with temporary files produced by execution of tests. This also cleanly allows a different data set to be cloned when a developer changes their local branch. I agree that it can be quite annoying to having to re-download the data if you delete the build directory. Although, the cloning time is probably a very small fraction of the total build time (well at least on a fast enough network).

In #2859, we have the same problem and I implemented an alternative solution: downloaded dependencies are cloned in ${PROJECT_SOURCE_DIR}/_deps (which is also added to .gitignore), so that even if you delete the build directory, no download is necessary. If wanted, we can implement the same solution here.

jdtournier commented 2 months ago

OK, so this was a valid concern...

On my home network, we're talking minutes to download all the test data for the scripts. Much longer than the actual build time - especially now that we're using ccache. If you have an alternative solution that gets around that, I think it would be better to go with that.

Lestropie commented 2 months ago

Regarding re-downloading of test data, I also have developed the habit of deleting a build directory and then re-constructing. However #2859 has these suggestions:

cmake -B build --fresh # It reconfigures the project deleting any CMakeCache.txt
cmake --build build --target clean # cleans any build artifacts
cmake --build build --clean-first # Same as previous line but triggers a build

So it might be possible to avoid repetitive long downloads just by changing developer practise.

We must however consider the case where one has multiple build directories:

Hypothetically, if one wanted to run tests for both, the test data would be redundantly stored in both build directories.
Conversely, one build directory may have been built from one version of source code, and another build directory from a different version of the source code that references a different commit in the testing data.

Given the latter, I'm uncomfortable having non-repository files requisite for building living outside of a build directory; both here and potentially for #2859 also.

I'd previously discussed the prospect of having environment variables to be interpreted at configure time for loading external dependencies from the local filesystem into the build directory rather than pulling them from the internet. Hypothetically, that argument could extend to the test data. If the developer has a test data repository cloned on their own system, and that clone contains the commit to which the source code repository currently refers, then hypothetically, the test data repository could be copied into the build directory and the requisite commit checked out (indeed, if that local clone is already pointing to the right commit, then just the data content could be pulled, skipping unnecessary duplication of git content). Perhaps not be worth the effort if just avoiding build directory deletion resolves, but thought it worth mentioning.

daljit46 commented 2 months ago

Given the latter, I'm uncomfortable having non-repository files requisite for building living outside of a build directory; both here and potentially for https://github.com/MRtrix3/mrtrix3/pull/2859 also.

I tend to agree with this. For #2859, cloning in the build directory is even less of an issue as downloading tarball for Eigen, Json for Modern C++ and Half only takes a few seconds.

daljit46 commented 1 month ago

@Lestropie @jdtournier I can't exactly recall what was the verdict on this issue, but did we decide to allow for the possibility of specifying a local URL for the data directory (so that we can clone the data into the build)? If yes, what should be the default behaviour?

jdtournier commented 1 month ago

did we decide to allow for the possibility of specifying a local URL for the data directory

Yes, that was more or less the conclusion we came to. Basically, the behaviour would be to always git clone the test data repo into the build directory, and checkout the relevant commit. This would allow any internet-connected user to run the tests without any further customisation. The URL of the main test data repo would be hard-coded and used by default, but the user will be able to override that URL through an environment variable (or whichever other mechanism is most CMake-compatible), so that the user could specify a local clone of the main repo instead, which would then reside on their own system (and they'd be responsible for updating it if necessary). This would also allow them to make whatever changes are required to the test data on their own system, and check that it all works before uploading the lot up to GitHub.

Does that sounds reasonable?

Lestropie commented 1 month ago

Sounds right to me.

Only addition would be that if the requested commit is not available, the resulting error could be different depending on whether the default GitHub repo or a manually-specified repo was used. In the former case, the issue could be either an erroneously specified commit, or it could be a failure to push committed changes to GitHub; whereas in the latter case, in addition to an erroneously specified commit it could be that the manually specified local repo is not fully up to date.

github-actions[bot] commented 1 month ago

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] commented 1 month ago

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] commented 1 month ago

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] commented 1 month ago

clang-tidy review says "All clean, LGTM! :+1:"

daljit46 commented 1 month ago

I have now rebased the changes by allowing the user to specify the MRTRIX_BINARIES_DATA_DIR and MRTRIX_SCRIPTS_DATA_DIR environment variables to set a local git directory containing the testing data. Do we want to stick with CMake cache variables instead (e.g. via specifying -DMRTRIX_BINARIES_DATA=...)?

Only addition would be that if the requested commit is not available, the resulting error could be different depending on whether the default GitHub repo or a manually-specified repo was used.

To clone the data we are using ExternalProject_Add and from CMake's documentation:

If the local clone already has the commit corresponding to the hash, no git fetch needs to be performed to check for changes each time CMake is re-run. This can result in a significant speed up if many external projects are being used.

Lestropie commented 1 month ago

Do we want to stick with CMake cache variables instead?

I think the projected workflow for developers here is that one would have a single clone of the test data repository, make all changes there, and just set an environment variable to point to that location so that all of their builds can make use of it. So while a cmake variable would be more conceptually faithful in terms of configuring a build system, if it's only there for expediting developer workflow, I think an environment variable is what would be maximally convenient for myself.

If the local clone already has the commit corresponding to the hash, no git fetch needs to be performed to check for changes each time CMake is re-run. This can result in a significant speed up if many external projects are being used.

Doesn't quite address the question, which was specifically whether there is explicit feedback to the developer that an attempt was made to use a specified local clone but that the requested commit wasn't available there so it is falling back to doing a complete clone from GitHub. I'm guessing if it's an existing function it will report something sensible in that scenario.

Also this quote references what happens for re-runs of cmake, not for the initial configuration.

I'll try to test.

daljit46 commented 1 month ago

Doesn't quite address the question, which was specifically whether there is explicit feedback to the developer that an attempt was made to use a specified local clone but that the requested commit wasn't available there so it is falling back to doing a complete clone from GitHub. I'm guessing if it's an existing function it will report something sensible in that scenario.

My understanding here is that ExternalProject_Add will internally use git clone and then git checkout thus one should get the same error messages as when using git (although CMake will report a download error and provide a path containing a log file with the actual full error message).

daljit46 commented 3 weeks ago

@MRtrix3/mrtrix3-devs I believe this is now ready to be merged.

github-actions[bot] commented 3 weeks ago

clang-tidy review says "All clean, LGTM! :+1:"

MRtrix3 / mrtrix3

Remove git submodule for testing data #2848