Open jermp opened 2 years ago
I think this makes sense, but right now there are some other changes that go a bit deeper that we would need to resolve. For example a few other files in the sshash source tree have been modified — for example, the builder was modified to consume the cuttlefish reduced GFA format, some functions were added to the utilities file, and some classes were made friends of other classes because of access requirements. Also, I just a couple of days ago moved a bunch of logging messages to spdlog
so that I could easily have a "quiet" build.
I agree having composability of source code as well as conceptually would be very nice. We could figure out what outside of the standard API would need to be exposed and figure out a minimal way to do that.
Oh ok sure, I forgot about the modified builder! Yes, that is something to think about and to be done as ultimate step maybe. Thanks @rob-p!
Keep track to modifications/additions to SSHash:
util.hpp
spdlog::info
others?
add option to parse cuttlefish GFA format (is it similar to "regular" GFA or it has some special constraints?)
It's even simpler; each line is basically just a numeric identifier, followed by the unitig it labels.
add needed utilities to util.hpp
Yes, and we can decide what code belongs in sshash
proper and what belongs in piscem-cpp
. One thing I've not been good about so far is separating them in terms of structure (the code for piscem-cpp
related stuff lives in the same source tree as the sshash
stuff). We might want to think about a better logical separation in terms of the file tree.
expose access to external code if needed
Yup — not sure what exactly has changed here, but not too much I think.
better logging function similar to spdlog::info
So that one I just sort of lept in on, as I'm super biased in favor of spdlog
for C++ logging. However, we can obviously consider changing it out for a similarly capable library. The big features of spdlog
that are made use of is the fact that it trivially enables efficient multi-threaded logging, and also, it's easy to log at different levels and change those at runtime (using spdlog::info
, spdlog::warn
, spdlog::critical
and such). So that makes implementing a "quiet" flag very easy (i.e. just don't log anything below the warn
level).
add option to parse cuttlefish GFA format (is it similar to "regular" GFA or it has some special constraints?)
It's even simpler; each line is basically just a numeric identifier, followed by the unitig it labels.
Oh yes, I think you passed me some examples some time ago. Ok, so, instead of the usual single-line .fasta format
>0
line0
>1
line1
...
we just have
0 line0
1 line1
...
add needed utilities to util.hpp
Yes, and we can decide what code belongs in
sshash
proper and what belongs inpiscem-cpp
. One thing I've not been good about so far is separating them in terms of structure (the code forpiscem-cpp
related stuff lives in the same source tree as thesshash
stuff). We might want to think about a better logical separation in terms of the file tree.
Yes, exactly the same I have in mind. That will also help very much in conveying the main take-home message of the library: easy composition with a minimal code wrapper on top, linking all the parts together.
better logging function similar to spdlog::info
So that one I just sort of lept in on, as I'm super biased in favor of
spdlog
for C++ logging. However, we can obviously consider changing it out for a similarly capable library. The big features ofspdlog
that are made use of is the fact that it trivially enables efficient multi-threaded logging, and also, it's easy to log at different levels and change those at runtime (usingspdlog::info
,spdlog::warn
,spdlog::critical
and such). So that makes implementing a "quiet" flag very easy (i.e. just don't log anything below thewarn
level).
Sure, I like it and it's much more evolved than my simple logger()
function :)
Ideally, I think all the sub-folders in /include
should be used as submodules in external/
as they are third-party libraries.
Maybe we will not be able to use all of them in this modular way, but this will increase code readability and modularity.
I think we should do this ASAP! :)
Hi @jermp! Thanks for the updates. I agree 100%. The longer we wait, the more painful the eventual re-integration with upstream will become. However, right now there are several things we'd have to add / modify regarding specific interfaces to interact cleanly with the piscem code. Mostly what's listed above regarding the logger (which is actually a bit more complicated because now we have spdlog but in a custom namespace to avoid C++ multiple symbol silliness when we compile piscem-cpp along with cuttlefish), and additionally there are 2 other tweaks that I'm aware of offhand.
First, there are a few added functions in this branch to assist in fast cached search (and to account for the observations we've made before that one must be extra careful when using caching on non-subsequent k-mers), and also an alteration to the default nucleotide encoding to use A:0, C:1, G:2, T:3 to match with the Kmer class used in the rest of the code.
I think we should create an sshash-submodule-integration branch, and once we have integrated, modified, and tested sshash as a submodule, we can merge that back into dev (and main).
--Rob
Ok, I'll create a new branch for SSHash, submodule-integration-for-piscem
.
regarding the logger (which is actually a bit more complicated because now we have spdlog but in a custom namespace to avoid C++ multiple symbol silliness when we compile piscem-cpp along with cuttlefish)
For this matter, I think the cleanest way to proceed is to rely on the already present logging mechanism for SSHash (we can make sure that nothing is printed if verbose=false
) and only use spdlog in piscem and/or cuttlefish.
alteration to the default nucleotide encoding to use A:0, C:1, G:2, T:3 to match with the Kmer class used in the rest of the code
For this one, we could have a compiler flag for SSHash which specifies what encoding to use. What do you think?
Ok, I'll create a new branch for SSHash, submodule-integration-for-piscem.
Awesome!
For this matter, I think the cleanest way to proceed is to rely on the already present logging mechanism for SSHash (we can make sure that nothing is printed if verbose=false) and only use spdlog in piscem and/or cuttlefish.
I agree this is the cleanest solution, though not totally ideal (as we won't get logging out of the sshash part and diagnosing any weird errors may be harder). Perhaps we can add a new command line flag like --enable-sshash-log
or some such, whose default would be false, that if enabled turns on sshash logging too?
For this one, we could have a compiler flag for SSHash which specifies what encoding to use. What do you think?
Sure that seems easy enough to add to the build script.
Once we start the integration we'll presumably run into the functions that are present in the piscem fork that aren't upstream. For those we can decide whether or not they are worth adding upstream to sshash, or if we should reorganize the relevant call sites in piscem.
Ok, branch created and solved the second issue of the encoding of nucleotides. See commit here https://github.com/jermp/sshash/commit/fd2cc097cd62a2cdb0f25dc4d1d441ac0b36b4a0 and in the README, https://github.com/jermp/sshash/tree/submodule-integration-for-piscem?tab=readme-ov-file#encoding-of-nucleotides.
Awesome! I've created a corresponding branch of this repo, where we can make relevant integration changes on this side. You should have write access, but let me know if you don't.
Perfect. The new branch is up to date with dev
, right? First step would be to import the new SSHash branch as a
submodule and then start to slowly remove the code from Piscem.
Ok done it! I've taken the liberty to also clean the outdated README :D Will write a new one from scratch. (The codebase right now does not compile of course.)
Ok, so I think we should have a checklist of differences and things to tackle (below is what I can think of now off the top of my head, but feel free to add more and I will add more if I recall any) :
.cf_seg
format)dev
branch of piscem-cpp
and sshash
(I piscem-cpp
I replaced the command line parsing with CLI11
and may have changed / added some flags (e.g. most flags also have a long option or a short option if they didn't before)
I think that it would be better if SSHash were used as a submodule, to better highlight the "separation of concerns". The only current modifications are done in the streaming query object, but that can also be reflected in the SSHash repo. Or, are there any hard limits to this?
For the moment, we can leave everything as it is, of course. But in the long term, I think piscem should highlight the composition SSHash+contig-table.