CMU-SAFARI / RawHash

RawHash is the first mechanism that can accurately and efficiently map raw nanopore signals to large reference genomes (e.g., a human reference genome) in real-time without using powerful computational resources (e.g., GPUs). Described by Firtina et al. (published at https://academic.oup.com/bioinformatics/article/39/Supplement_1/i297/7210440)
https://academic.oup.com/bioinformatics/article/39/Supplement_1/i297/7210440
GNU General Public License v3.0
39 stars 5 forks source link

Use CMakeLists.txt instead of Makefile #4

Closed adamant-pwn closed 2 weeks ago

adamant-pwn commented 2 months ago

This pull request is supposed to implement the switch from Makefile to CMakeLists.txt. Making it a pull request rather than direct commit, as the change is significant, and there might still be potential discrepancies with prior behavior.

maximilianmordig commented 1 month ago

The docker build hang still needs to be resolved. Also, SetupRUClient.cmake should be cleaned up to properly include the cmake project rather than hardcode all the deps. We can merge it in the meantime to start using cmake, and fix that in a separate branch.

adamant-pwn commented 1 month ago

@maximilianmordig @canfirtina hi, any further feedback for this PR?

It now successfully builds RawHash with POD5, HDF5 and SLOW5 and runs rawhash2 -h. Build time in actions is around 30m if all 3 are used. Some further work is needed with ReadUntil, but it's blocked by incompatibility between the current state of RUclient-dependent code and readuntil_fake's default branch, which I'm not sure how to fix, see https://github.com/CMU-SAFARI/RawHash/pull/4#discussion_r1605526841.

Also it's my first time using Docker and manually setting up a GitHub Workflow with it, so please let me know if I missed anything important.

maximilianmordig commented 1 month ago

builds are too slow: cache the build directory as well (as an artifact)

You may run into an issue when submodules are updated because ExternalProject_Add did not work for me with slow5lib. I had to delete the build dir. ExternalProject_Add is not the best option.

Will merge it, so this can be fixed in another branch

maximilianmordig commented 1 month ago

Most urgent: We need shared libraries. I tried to change it and this led to a ton of problems that are related to how the external projects like hdf5 are imported. The paths need to be correctly set so that the shared libraries can be properly included. I created an example target in the cmake file using this lib: rawhash2_usinglib . After running cmake and setting the path appropriately, you can run rawhash2_usinglib rawhash2_usinglib: error while loading shared libraries: ../extern/slow5lib/lib/libslow5.so: cannot open shared object file: No such file or directory If you want, also see the notebook located at python_wrapper/example.ipynb. Issues of less priority: mold linker does not seem to be used although it is detected, ccache is also not used. Using PARENT_SCOPE seems like improper design. For example, for pod5, you should define an imported target and set the libraries on that target, so it is easy to include. Right now, it is a bit messy. resolve_pod5_url is misleading, it is doing much more that resolving the url. The logic is also flawed, e.g. for pod5: when setting pod5 dir, libraries are not set. When I set NOPOD5 to ON, it still tries to build it, which is incorrect.

adamant-pwn commented 4 weeks ago

Hi Max, thanks for the update! As a note, I used static libraries because that's what was used in the initial Makefile, and I didn't want to disrupt the logic, as I wasn't sure if there was a specific reasoning behind using them earlier.

For now, I pushed a quick fix that should now use proper path for slow5lib (at least it seems to work for me locally, let's see if CI also work with it). I would've implement it this way from the start if I knew that we need shared libraries, as previous configuration was like this because slow5lib's CMakeLists.txt doesn't seem to generate a static library, and somewhat annoyingly doesn't have an install target.

I can look further into other issues next week.

maximilianmordig commented 3 weeks ago

The goal is to clean up the build process, so make whatever changes are needed, following the best practices for cmake. Of course, functionality should be similar. For the logic, the main idea is to have debug/no-debug, disable certain file formats on demand (like slow5, pod5, fast5). Since compiling these external libs takes time, one should also think where to compile them to, so that rawhash clean can be done without having to recompile these external libs. When the end user wants to avoid compiling the external lib, it should be possible to set the path to the compiled library. Regarding the outputs, the cmake should also provide an install command with bin/ and lib/ directory. Ideally, rawhash can be integrated into other cmake projects.

I modified slow5tools at some point in my own fork (https://github.com/maximilianmordig/slow5lib), so feel free to do so as well.

maximilianmordig commented 3 weeks ago

include dir should also be there instead of having to use src/rawhash_wrapper.hpp.

maximilianmordig commented 3 weeks ago

ccache is not operating, please fix.

adamant-pwn commented 3 weeks ago

Hi Max! In the future, could you please leave comments in the "Files changed" section as a code review comments, whenever possible? This way it would be much easier to reply and keep track of them separately than in the general "Conversation" section.

builds are too slow: cache the build directory as well (as an artifact)

It is done now, and CI seems to take around 13m instead of 30m (not sure why it seemingly builds tflite or large portion of it from scratch anyway). But it's probably important to point out that the builds now will not reproduce exactly what happens on "clean" installation, so e.g. if we update default cached values, they will not be properly propagated.

I'm not sure what's the best way to deal with it. Add build/CMakeCache.txt, but not build to dockerignore? Or maybe ignore build directory cache on push to main branch, but use it in pull requests?..

Generally I'm a bit wary of using this approach, as it is likely to create some hard to catch issues. Are you sure that we should do this?

You may run into an issue when submodules are updated because ExternalProject_Add did not work for me with slow5lib. I had to delete the build dir. ExternalProject_Add is not the best option.

ExternalProject_Add allows us to avoid mingling scope. We could also use add_subdirectory or FetchContent as alternatives, but they're not uniformly applicable (e.g. won't work for POD5 which is downloaded), and they're preferred for projects which structure we understand very well and have some kind of control over it (see e.g. here), but it seems to me that third-party projects are generally shouldn't be considered such? With slow5lib in particular though, the issue seems to have been related with misconfiguration that didn't work for .so and should be fixed now.

We need shared libraries. I tried to change it and this led to a ton of problems that are related to how the external projects like hdf5 are imported. The paths need to be correctly set so that the shared libraries can be properly included. I created an example target in the cmake file using this lib: rawhash2_usinglib . After running cmake and setting the path appropriately, you can run rawhash2_usinglib rawhash2_usinglib: error while loading shared libraries: ../extern/slow5lib/lib/libslow5.so: cannot open shared object file: No such file or directory If you want, also see the notebook located at python_wrapper/example.ipynb.

Seems to be done.

mold linker does not seem to be used although it is detected, ccache is also not used.

Mold seems to already work after your updates. I also added CACHE STRING to CMAKE_C(XX)_COMPILER_LAUNCHER for ccache, and it seems to work too now. It might be needed to force-override them if it still doesn't work.

When I set NOPOD5 to ON, it still tries to build it, which is incorrect.

It works for me. Maybe NOPOD5=OFF got cached on your side? For me, running cmake -DNOPOD5=ON .. ensures that POD5 won't compile.


I'm also looking into some other issues that you've mentioned, will keep posting updates.