SSHash as git submodule

jermp commented 2 years ago

I think that it would be better if SSHash were used as a submodule, to better highlight the "separation of concerns". The only current modifications are done in the streaming query object, but that can also be reflected in the SSHash repo. Or, are there any hard limits to this?

For the moment, we can leave everything as it is, of course. But in the long term, I think piscem should highlight the composition SSHash+contig-table.

rob-p commented 2 years ago

I think this makes sense, but right now there are some other changes that go a bit deeper that we would need to resolve. For example a few other files in the sshash source tree have been modified — for example, the builder was modified to consume the cuttlefish reduced GFA format, some functions were added to the utilities file, and some classes were made friends of other classes because of access requirements. Also, I just a couple of days ago moved a bunch of logging messages to spdlog so that I could easily have a "quiet" build.

I agree having composability of source code as well as conceptually would be very nice. We could figure out what outside of the standard API would need to be exposed and figure out a minimal way to do that.

jermp commented 2 years ago

Oh ok sure, I forgot about the modified builder! Yes, that is something to think about and to be done as ultimate step maybe. Thanks @rob-p!

jermp commented 2 years ago

Keep track to modifications/additions to SSHash:

add option to parse cuttlefish GFA format (is it similar to "regular" GFA or it has some special constraints?)
add needed utilities to util.hpp
expose access to external code if needed
better logging function similar to spdlog::info

others?

rob-p commented 2 years ago

add option to parse cuttlefish GFA format (is it similar to "regular" GFA or it has some special constraints?)

It's even simpler; each line is basically just a numeric identifier, followed by the unitig it labels.

add needed utilities to util.hpp

Yes, and we can decide what code belongs in sshash proper and what belongs in piscem-cpp. One thing I've not been good about so far is separating them in terms of structure (the code for piscem-cpp related stuff lives in the same source tree as the sshash stuff). We might want to think about a better logical separation in terms of the file tree.

expose access to external code if needed

Yup — not sure what exactly has changed here, but not too much I think.

better logging function similar to spdlog::info

So that one I just sort of lept in on, as I'm super biased in favor of spdlog for C++ logging. However, we can obviously consider changing it out for a similarly capable library. The big features of spdlog that are made use of is the fact that it trivially enables efficient multi-threaded logging, and also, it's easy to log at different levels and change those at runtime (using spdlog::info, spdlog::warn, spdlog::critical and such). So that makes implementing a "quiet" flag very easy (i.e. just don't log anything below the warn level).

jermp commented 2 years ago

add option to parse cuttlefish GFA format (is it similar to "regular" GFA or it has some special constraints?)

It's even simpler; each line is basically just a numeric identifier, followed by the unitig it labels.

Oh yes, I think you passed me some examples some time ago. Ok, so, instead of the usual single-line .fasta format

>0
line0
>1
line1
...

we just have

0 line0
1 line1
...

add needed utilities to util.hpp

Yes, and we can decide what code belongs in sshash proper and what belongs in piscem-cpp. One thing I've not been good about so far is separating them in terms of structure (the code for piscem-cpp related stuff lives in the same source tree as the sshash stuff). We might want to think about a better logical separation in terms of the file tree.

Yes, exactly the same I have in mind. That will also help very much in conveying the main take-home message of the library: easy composition with a minimal code wrapper on top, linking all the parts together.

better logging function similar to spdlog::info

So that one I just sort of lept in on, as I'm super biased in favor of spdlog for C++ logging. However, we can obviously consider changing it out for a similarly capable library. The big features of spdlog that are made use of is the fact that it trivially enables efficient multi-threaded logging, and also, it's easy to log at different levels and change those at runtime (using spdlog::info, spdlog::warn, spdlog::critical and such). So that makes implementing a "quiet" flag very easy (i.e. just don't log anything below the warn level).

Sure, I like it and it's much more evolved than my simple logger() function :) Ideally, I think all the sub-folders in /include should be used as submodules in external/ as they are third-party libraries. Maybe we will not be able to use all of them in this modular way, but this will increase code readability and modularity.

jermp commented 10 months ago

I think we should do this ASAP! :)

rob-p commented 10 months ago

Hi @jermp! Thanks for the updates. I agree 100%. The longer we wait, the more painful the eventual re-integration with upstream will become. However, right now there are several things we'd have to add / modify regarding specific interfaces to interact cleanly with the piscem code. Mostly what's listed above regarding the logger (which is actually a bit more complicated because now we have spdlog but in a custom namespace to avoid C++ multiple symbol silliness when we compile piscem-cpp along with cuttlefish), and additionally there are 2 other tweaks that I'm aware of offhand.

First, there are a few added functions in this branch to assist in fast cached search (and to account for the observations we've made before that one must be extra careful when using caching on non-subsequent k-mers), and also an alteration to the default nucleotide encoding to use A:0, C:1, G:2, T:3 to match with the Kmer class used in the rest of the code.

I think we should create an sshash-submodule-integration branch, and once we have integrated, modified, and tested sshash as a submodule, we can merge that back into dev (and main).

--Rob

jermp commented 10 months ago

Ok, I'll create a new branch for SSHash, submodule-integration-for-piscem.

regarding the logger (which is actually a bit more complicated because now we have spdlog but in a custom namespace to avoid C++ multiple symbol silliness when we compile piscem-cpp along with cuttlefish)

For this matter, I think the cleanest way to proceed is to rely on the already present logging mechanism for SSHash (we can make sure that nothing is printed if verbose=false) and only use spdlog in piscem and/or cuttlefish.

alteration to the default nucleotide encoding to use A:0, C:1, G:2, T:3 to match with the Kmer class used in the rest of the code

For this one, we could have a compiler flag for SSHash which specifies what encoding to use. What do you think?

rob-p commented 10 months ago

Ok, I'll create a new branch for SSHash, submodule-integration-for-piscem.

Awesome!

For this matter, I think the cleanest way to proceed is to rely on the already present logging mechanism for SSHash (we can make sure that nothing is printed if verbose=false) and only use spdlog in piscem and/or cuttlefish.

I agree this is the cleanest solution, though not totally ideal (as we won't get logging out of the sshash part and diagnosing any weird errors may be harder). Perhaps we can add a new command line flag like --enable-sshash-log or some such, whose default would be false, that if enabled turns on sshash logging too?

For this one, we could have a compiler flag for SSHash which specifies what encoding to use. What do you think?

Sure that seems easy enough to add to the build script.

Once we start the integration we'll presumably run into the functions that are present in the piscem fork that aren't upstream. For those we can decide whether or not they are worth adding upstream to sshash, or if we should reorganize the relevant call sites in piscem.

jermp commented 10 months ago

Ok, branch created and solved the second issue of the encoding of nucleotides. See commit here https://github.com/jermp/sshash/commit/fd2cc097cd62a2cdb0f25dc4d1d441ac0b36b4a0 and in the README, https://github.com/jermp/sshash/tree/submodule-integration-for-piscem?tab=readme-ov-file#encoding-of-nucleotides.

rob-p commented 10 months ago

Awesome! I've created a corresponding branch of this repo, where we can make relevant integration changes on this side. You should have write access, but let me know if you don't.

jermp commented 10 months ago

Perfect. The new branch is up to date with dev, right? First step would be to import the new SSHash branch as a submodule and then start to slowly remove the code from Piscem.

jermp commented 10 months ago

Ok done it! I've taken the liberty to also clean the outdated README :D Will write a new one from scratch. (The codebase right now does not compile of course.)

rob-p commented 10 months ago

Ok, so I think we should have a checklist of differences and things to tackle (below is what I can think of now off the top of my head, but feel free to add more and I will add more if I recall any) :

[ ] Allow index builder to consume Cuttlefish 1 format input (the .cf_seg format)
[ ] Allow the build command to accept a thread parameter
[ ] Related to the above, fix the bug where index fails to build for small sets of sequences and small values of k (may be related to, already resolved by https://github.com/jermp/sshash/issues/33)
[ ] Cross check command line parameters between dev branch of piscem-cpp and sshash (I piscem-cpp I replaced the command line parsing with CLI11 and may have changed / added some flags (e.g. most flags also have a long option or a short option if they didn't before)
[ ] Check differences to / changes to the query API (specifically, think about the streaming query and make sure to port over all of the relevant changes / fixes for querying with non-adjacent k-mers)
[ ] Add a flag to turn on/off logging during build.

COMBINE-lab / piscem-cpp

SSHash as git submodule #2