CamilaDuitama / muset

MUSET. Set of utilities for the construction of abundance unitig matrices from sequencing data
MIT License
3 stars 0 forks source link

Changes in the repository structure and other modifications #9

Closed rvicedomini closed 3 months ago

rvicedomini commented 3 months ago

Hi @CamilaDuitama & @frankandreace ,

I want to mention that I did some important changes on the repository.

I don't know if there is more to add... if there is, I'll add it later...

rvicedomini commented 3 months ago

I've did some modifications related to the -s option. More precisely, I replaced the -s option with the -i option which accepts a file path to an existing matrix (before, -s assumed the matrix had a fixed name within the output folder). Now you can use a previously computed input matrix and make the pipeline work in a different output directory (if desired).

One thing I'd like to do is to avoid requiring the positional argument (fof file) when the input matrix is provided through -i, as it is only required by kmtricks (which is skipped).

rvicedomini commented 3 months ago

I just committed an update that ignores the positional argument if present when -i option is provided.

rvicedomini commented 3 months ago

@CamilaDuitama & @frankandreace, I was looking at the code and noticed that the last part of the pipeline is quite inefficient for two main reasons. First, sorting does not seem to be needed (also considering that ggcat output is not deterministic). Second, for each unitig ID, a linear scan of the unitig FASTA is done to find its sequence (which makes this step quadratic in the number of unitigs).

If all we want is to have, as final output, a unitig matrix containing the actual sequence in the first column (instead of the ID), I only have to change one line of code for the command kmat_tools unitig (thus skipping the aforementioned part). I could also add a flag if you want the user to decide whether to have the ID or the sequence in the first column.

CamilaDuitama commented 3 months ago

Yes please, if it's more easily done from the kmat_tools unitig code then it's much better. Thanks @rvicedomini

rvicedomini commented 3 months ago

Flag -s has been added to both muset and kmat_tools unitig commands in order to write unitig sequence in the output matrix instead of the ID (default behavior when the flag is not used).

I also made a small change to the main script. More precisely, I now update the PATH environment variable at the beginning using the directory of the script. In this way, the kmat_tools executable which is called should be the one within the same folder of muset.