Closed daljit46 closed 2 months ago
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
Makes sense to me. Hopefully this would also by design remove some of the weird annoyances around submodules not changing commit hash on main repo branch change / getting mistakenly included in git commit -a
. I'm also thinking that any divergences in test data between code branches should be less likely.
The outstanding relevant question is whether there should remain two test data repositories, or whether they should be merged into one. (And the "unit test" data added in #2678 would naturally go there also) But that can perhaps be addressed as part of #2824.
Also: Would this be a preferable alternative to submodules for other external dependencies (#2584)? I suppose the disadvantage there is that IDEs wouldn't be able to find the code for the relevant functions unless you put the code somewhere else and pointed the IDE to it yourself.
There might be a slight disadvantage to developer workflow, but it is tolerable.
Whenever an update to the test data is required, typically one would:
run_tests
succeedsHere, I presume the workflow will be:
ctest
succeeds?
Also: Would this be a preferable alternative to submodules for other external dependencies (#2584)? I suppose the disadvantage there is that IDEs wouldn't be able to find the code for the relevant functions unless you put the code somewhere else and pointed the IDE to it yourself.
I think for this purpose, FetchContent is more suitable than ExternalProject
as it makes content available immediately allowing it to be used in the configuration stage.
Something like this would work:
include(FetchContent)
FetchContent_Declare(json URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz)
FetchContent_MakeAvailable(json)
target_link_libraries(mrtrix-core PRIVATE nlohmann_json::nlohmann_json)
and IDEs shouldn't have trouble picking this up.
Would like to get this merged too unless there are any objections.
Before we do, I'm not sure I get the rationale.
My understanding of the original issue was that the testing ought to run from within the build directory. I kind of get that bit. But then there's a suggestion to download the test data into the build directory, and that's where I get a bit lost. Does that mean that if we delete the build folder, the system will have to download all the test data again? That's a fair bit of data... I'm hoping that the ExternalProject
feature somehow gets around that issue? Otherwise I think this is going to prove very impractical...
Hopefully this is just my ignorance about how that feature works, and a quick bit of education will sort that out...
Yes, the data will be cloned in the build directory. However, the original issue #2837 was to address the problem that our source directory can get polluted with temporary files produced by execution of tests. This also cleanly allows a different data set to be cloned when a developer changes their local branch. I agree that it can be quite annoying to having to re-download the data if you delete the build directory. Although, the cloning time is probably a very small fraction of the total build time (well at least on a fast enough network).
In #2859, we have the same problem and I implemented an alternative solution: downloaded dependencies are cloned in ${PROJECT_SOURCE_DIR}/_deps
(which is also added to .gitignore
), so that even if you delete the build directory, no download is necessary. If wanted, we can implement the same solution here.
OK, so this was a valid concern...
On my home network, we're talking minutes to download all the test data for the scripts. Much longer than the actual build time - especially now that we're using ccache
. If you have an alternative solution that gets around that, I think it would be better to go with that.
Regarding re-downloading of test data, I also have developed the habit of deleting a build directory and then re-constructing. However #2859 has these suggestions:
cmake -B build --fresh # It reconfigures the project deleting any CMakeCache.txt
cmake --build build --target clean # cleans any build artifacts
cmake --build build --clean-first # Same as previous line but triggers a build
So it might be possible to avoid repetitive long downloads just by changing developer practise.
We must however consider the case where one has multiple build directories:
Given the latter, I'm uncomfortable having non-repository files requisite for building living outside of a build directory; both here and potentially for #2859 also.
I'd previously discussed the prospect of having environment variables to be interpreted at configure time for loading external dependencies from the local filesystem into the build directory rather than pulling them from the internet. Hypothetically, that argument could extend to the test data. If the developer has a test data repository cloned on their own system, and that clone contains the commit to which the source code repository currently refers, then hypothetically, the test data repository could be copied into the build directory and the requisite commit checked out (indeed, if that local clone is already pointing to the right commit, then just the data content could be pulled, skipping unnecessary duplication of git content). Perhaps not be worth the effort if just avoiding build directory deletion resolves, but thought it worth mentioning.
Given the latter, I'm uncomfortable having non-repository files requisite for building living outside of a build directory; both here and potentially for https://github.com/MRtrix3/mrtrix3/pull/2859 also.
I tend to agree with this. For #2859, cloning in the build directory is even less of an issue as downloading tarball for Eigen, Json for Modern C++ and Half only takes a few seconds.
@Lestropie @jdtournier I can't exactly recall what was the verdict on this issue, but did we decide to allow for the possibility of specifying a local URL for the data directory (so that we can clone the data into the build)? If yes, what should be the default behaviour?
did we decide to allow for the possibility of specifying a local URL for the data directory
Yes, that was more or less the conclusion we came to. Basically, the behaviour would be to always git clone
the test data repo into the build directory, and checkout the relevant commit. This would allow any internet-connected user to run the tests without any further customisation. The URL of the main test data repo would be hard-coded and used by default, but the user will be able to override that URL through an environment variable (or whichever other mechanism is most CMake-compatible), so that the user could specify a local clone of the main repo instead, which would then reside on their own system (and they'd be responsible for updating it if necessary). This would also allow them to make whatever changes are required to the test data on their own system, and check that it all works before uploading the lot up to GitHub.
Does that sounds reasonable?
Sounds right to me.
Only addition would be that if the requested commit is not available, the resulting error could be different depending on whether the default GitHub repo or a manually-specified repo was used. In the former case, the issue could be either an erroneously specified commit, or it could be a failure to push committed changes to GitHub; whereas in the latter case, in addition to an erroneously specified commit it could be that the manually specified local repo is not fully up to date.
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
I have now rebased the changes by allowing the user to specify the MRTRIX_BINARIES_DATA_DIR
and MRTRIX_SCRIPTS_DATA_DIR
environment variables to set a local git directory containing the testing data. Do we want to stick with CMake cache variables instead (e.g. via specifying -DMRTRIX_BINARIES_DATA=...
)?
Only addition would be that if the requested commit is not available, the resulting error could be different depending on whether the default GitHub repo or a manually-specified repo was used.
To clone the data we are using ExternalProject_Add
and from CMake's documentation:
If the local clone already has the commit corresponding to the hash, no git fetch needs to be performed to check for changes each time CMake is re-run. This can result in a significant speed up if many external projects are being used.
Do we want to stick with CMake cache variables instead?
I think the projected workflow for developers here is that one would have a single clone of the test data repository, make all changes there, and just set an environment variable to point to that location so that all of their builds can make use of it. So while a cmake variable would be more conceptually faithful in terms of configuring a build system, if it's only there for expediting developer workflow, I think an environment variable is what would be maximally convenient for myself.
If the local clone already has the commit corresponding to the hash, no git fetch needs to be performed to check for changes each time CMake is re-run. This can result in a significant speed up if many external projects are being used.
Doesn't quite address the question, which was specifically whether there is explicit feedback to the developer that an attempt was made to use a specified local clone but that the requested commit wasn't available there so it is falling back to doing a complete clone from GitHub. I'm guessing if it's an existing function it will report something sensible in that scenario.
Also this quote references what happens for re-runs of cmake, not for the initial configuration.
I'll try to test.
Doesn't quite address the question, which was specifically whether there is explicit feedback to the developer that an attempt was made to use a specified local clone but that the requested commit wasn't available there so it is falling back to doing a complete clone from GitHub. I'm guessing if it's an existing function it will report something sensible in that scenario.
My understanding here is that ExternalProject_Add
will internally use git clone
and then git checkout
thus one should get the same error messages as when using git
(although CMake will report a download error and provide a path containing a log file with the actual full error message).
@MRtrix3/mrtrix3-devs I believe this is now ready to be merged.
clang-tidy review says "All clean, LGTM! :+1:"
This PR aims to address #2837. The idea is to remove the Git submodule for testing data from the codebase. Instead, testing data will be cloned in the build directory at build time. This is made possible by using CMake's ExternalProject.