EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
317 stars 70 forks source link

How to save file reading/writing? #225

Closed Shen-Jing closed 3 years ago

Shen-Jing commented 3 years ago

Hello, I'm sorry to disturb you.

Here is my workflow:

Input

Steps (pseudocode)

while (/* reads data left */)
{
    1-1. Using 3rd library to generate Multiple Sequence Alignment in temporary strings
    1-2. Output these strings to `sample.afa` file
    2. Using `hmmbuild sample.hmm sample.afa` to build profile hmm.
    3. Using `hmmemit sample.hmm` to generate consensus
}

Problem

I have finished my program (written in C++). But the performance bottleneck is I/O wait. There are heavy disk I/O in my workflow.

My ideal workflow:

1-1. (Same) Using 3rd library to generate Multiple Sequence Alignment

  1. reading these MSA string and store result in P7_HMM variable

In this way, it save these overheads:

  1. MSA file writing
  2. MSA file reading
  3. HMM file writing
  4. HMM file reading

But I couldn't find out which function (source code) I could apply in my ideal workflow. I only know msafile_OpenBuffer, p7_hmmfile_WriteASCII, p7_hmmfile_Read could be used to read/write file. I don't know how to pipe them together to save I/O. It beats me.

Could you please give some help?

Sorry my English and logic are pretty limited if my description confused you.

Thanks in advance.

cryptogenomicon commented 3 years ago

If you need to generate a consensus sequence from a MSA, there are probably more efficient ways than going through a profile HMM.

If you want to generate a profile HMM without saving the MSA to disk, you can do something like:

%  <msa_making_program> | hmmbuild --informat afa -n <model_name> <hmmfile> -

where <msa_making_program> generates an aligned FASTA (afa) file on stdout, and hmmbuild will read it from stdin if you give - as the input argument. When reading from a stream, hmmbuild is unable to determine the input file format, so the --informat afa option is needed to specify it; and when reading an AFA file from a stream, hmmbuild is unable to determine what name you might want to assign to the profile, so you have to specify one with -n <model_name>. This will save the new profile to <hmmfile>, and at least remove your steps 1 and 2.

If you really want to modify the source code, I'm sorry, that question is a bit beyond what I have time to answer; apologies.

Shen-Jing commented 3 years ago

Hi,

Thanks for the quick reply!

If you really want to modify the source code

Yes, this is what I want. In fact, I need P7_HMM variable (hmm) for more operation. The hmmemit is my partial step.


you can do something like:

Thanks your advice, program example and explanation! It's a clear instruction. I will think about whether to change my implementation. 😥

Thank you!