EddyRivasLab / infernal

RNA secondary structure/sequence profiles for homology search and alignment
Other
100 stars 24 forks source link

column offset in `--tblout`? #39

Closed sjanssen2 closed 10 months ago

sjanssen2 commented 1 year ago

I found an offset in the starting positions of columns when using cmsearch with the 14.9 Rfam collection against the UCSC derived hg19 assembly.

#target name         accession query name           accession mdl mdl from   mdl to  seq from    seq to strand trunc pass   gc  bias  score   E-value inc description of target
#------------------- --------- -------------------- --------- --- -------- -------- --------- --------- ------ ----- ---- ---- ----- ------ --------- --- ---------------------
chr1                 -         5S_rRNA              RF00001    cm        1      119 228746133 228746015      -    no    1 0.61   0.0  118.9   6.6e-24 !   -
chr1                 -         5S_rRNA              RF00001    cm        1      119 228748374 228748256      -    no    1 0.61   0.0  118.9   6.6e-24 !   -
.
.
.
chr2                 -         U1                   RF00003    cm        1      166  56274720  56274797      +    no    1 0.33   0.0   20.2       7.4 ?   -
chr1                 -         U1                   RF00003    cm        1      166 109517705 109517546      -    no    1 0.38   0.0   19.9       9.2 ?   -
chr11                 -         U2                   RF00004    cm        1      193  62609281  62609091      -    no    1 0.44   0.0  191.2   9.7e-40 !   -
chr10                 -         U2                   RF00004    cm        1      193 103124792 103124602      -    no    1 0.45   0.0  179.5   6.2e-37 !   -
.
.
.
chr8                  -         U2                   RF00004    cm        1      193  57741437  57741496      +    no    1 0.28   0.0   24.4       9.9 ?   -
chr15                 -         U2                   RF00004    cm        1      193  54034824  54035003      +    no    1 0.34   0.5   24.4        10 ?   -
chr6                 -         tRNA                 RF00005    cm        1       71  28442402  28442329      -    no    1 0.50   0.0   76.8   2.4e-14 !   -
chr7                 -         tRNA                 RF00005    cm        1       71 149007281 149007352      +    no    1 0.56   0.0   73.9   1.6e-13 !   -

If you take a look at the second column accession the fields are all - but this character seems to jump back and forth from position 22 to 23. This breaks my current parser and I wonder if this is intended or not. The above example is produced with version 1.1.3 on Linux rs500-bcf-10 5.15.0-78-generic #85~20.04.1-Ubuntu SMP Mon Jul 17 09:42:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux I am currently running INFERNAL 1.1.5 (Sep 2023) to see if this issue persists ... yes, it does

nawrockie commented 1 year ago

@sjanssen2 : can you please attach the full tblout file, or better yet a minimal set of files with command that reproduces this?

nawrockie commented 1 year ago

This behavior is expected because the spacing for cmsearch tabular output can change for each CM (query). Based on what you provided above, it seems that the maximum width string for the target name field for all hits for U1 was one character less than for U2, for example. But I would have to see the full tblout file to verify that.

sjanssen2 commented 1 year ago

cmds:

wget https://ftp.ebi.ac.uk/pub/databases/Rfam/14.9/Rfam.cm.gz
gunzip Rfam.cm.gz
https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
zcat hg19.fa.gz > hg19.fa
cmsearch --tblout cmsearch.Rfam.hg19.tbl Rfam.cm hg19.fa 
nawrockie commented 1 year ago

@sjanssen2 : I think that we simultaneously commented, as unlikely as that seems. Does my second comment from yesterday answer your question? If you'd like me to check the tblout file, if you can grep out the U1 and U2 lines and send me that, then I should be able to verify if what I wrote above applies here.

sjanssen2 commented 1 year ago

Hi @nawrockie you are right, we cross-posted :-) Here comes the subset of U1 and U2: U1U2.tbl

My goal is to parse the (tabular) output of cmsearch into a pandas.DataFrame. I first thought I can use \t as separators, but there aren't any. Thus I tried to orient on the number of - in below the header line. If U1 and U2 and others in fact use different column width, this is also not possible. Thus I started splitting on whitespaces \s+. Is my assumption correct, that field contents will never contain as non-separators? And why are there columns without header names? In the stdout output, I find tables like

Query:       RNaseP_nuc  [CLEN=303]
Accession:   RF00009
Description: Nuclear RNase P
Hit scores:
 rank     E-value  score  bias  sequence     start       end   mdl trunc   gc  description
 ----   --------- ------ -----  -------- --------- ---------   --- ----- ----  -----------
  (1) !     5e-74  240.1   5.5  chr14     20811566  20811234 -  cm    no 0.65  -
  (2) !   4.8e-30  107.5   0.0  chr4     157907845 157907524 -  cm    no 0.53  -
  (3) !    0.0046   26.2   0.0  chr14     65376498  65376206 -  cm    no 0.45  -
 ------ inclusion threshold ------
  (4) ?     0.067   22.7   0.0  chr16     50615708  50615324 -  cm    no 0.50  -
  (5) ?      0.11   22.1   0.0  chr4     154785514 154785635 +  cm    no 0.40  -

The strand information does not come with a header name (which would not fit anyways) but also without a - to indicate the column. What is your suggestion on parsing Infernal output into pandas DataFrames or other tabular data structures?

nawrockie commented 1 year ago

The spacing difference is explained by a difference in the max length string in target name field of the tblout file. For U1, this is chr1_gl000192_random and for U2 it is chr17_gl000203_random.

Yes, you should be able to split on \s+ as whitespace won't occur as a non-separator except for the final description field, but since that occurs at the end there is usually a not so terrible way to work around that with a parser, in my experience.

My suggestion about parsing cmsearch output is to use the tblout file as that was intended for easier parsing than the standard output.

queirozhanna commented 11 months ago

Hi! I was reading the tutorial and tried the "cmbuild" command but my terminal always says the command was not found. I've already configured the program but the error persists Capturar How can I solve this?

npcarter commented 11 months ago

Hanna,

How did you install Infernal?  In particular, did you do ‘make install’ or just ‘make’?  If you ran ‘make install’, which you’d probably have to do as root, the Infernal programs should be installed in places that your default PATH will find them.  If you just ran ‘make’, the Infernal programs will be built somewhere in your Infernal-1.1.1 directory and you’ll need to provide the correct path to them to run them.  In HMMER, the programs are built in the ‘src’ sub-directory, but I haven’t worked with Infernal.

Here’s something that might help: 1) run ‘find . -name cmbuild’ from your home directory.  If you’ve built Infernal correctly, that should return the path from your home directory to the cmbuild program.  You can then run cmbuild by entering that full path.  For example, here’s me doing something similar with a HMMER install on my machine:

@.:~/h3-server$ find . -name hmmscan ./src/hmmscan @.:~/h3-server$ ./src/hmmscan -h

hmmscan :: search sequence(s) against a profile database

HMMER 3.3.2 (Nov 2020); http://hmmer.org/

Copyright (C) 2020 Howard Hughes Medical Institute.

Freely distributed under the BSD open source license.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Usage: hmmscan [-options]

To be able to use cmbuild more conveniently, you can add the directory where cmbuild and the other Infernal programs reside to your PATH variable.  A web search should return lots of instructions on how to do this if you aren’t familiar with that process.

-Nick

On Nov 2, 2023 at 4:01 PM -0400, Hanna @.***>, wrote:

Hi! I was reading the tutorial and tried the "cmbuild" command but my terminal always says the command was not found. I've already configured the program but the error persists

How can I solve this? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

nawrockie commented 11 months ago

Like hmmer, with make programs like cmbuild will be installed in infernal-1.1.5/src/ so running ./src/cmbuild tRNA5.cm tutorial/tRNA.5.sto from the infernal-1.1.5 directory should work.

Installation instructions are here: https://github.com/EddyRivasLab/infernal/blob/master/README.md#to-download-and-build-the-current-source-code-release

queirozhanna commented 11 months ago

Hanna, How did you install Infernal?  In particular, did you do ‘make install’ or just ‘make’?  If you ran ‘make install’, which you’d probably have to do as root, the Infernal programs should be installed in places that your default PATH will find them.  If you just ran ‘make’, the Infernal programs will be built somewhere in your Infernal-1.1.1 directory and you’ll need to provide the correct path to them to run them.  In HMMER, the programs are built in the ‘src’ sub-directory, but I haven’t worked with Infernal. Here’s something that might help: 1) run ‘find . -name cmbuild’ from your home directory.  If you’ve built Infernal correctly, that should return the path from your home directory to the cmbuild program.  You can then run cmbuild by entering that full path.  For example, here’s me doing something similar with a HMMER install on my machine: @.:~/h3-server$ find . -name hmmscan ./src/hmmscan @.:~/h3-server$ ./src/hmmscan -h # hmmscan :: search sequence(s) against a profile database # HMMER 3.3.2 (Nov 2020); http://hmmer.org/ # Copyright (C) 2020 Howard Hughes Medical Institute. # Freely distributed under the BSD open source license. # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Usage: hmmscan [-options] To be able to use cmbuild more conveniently, you can add the directory where cmbuild and the other Infernal programs reside to your PATH variable.  A web search should return lots of instructions on how to do this if you aren’t familiar with that process. -Nick On Nov 2, 2023 at 4:01 PM -0400, Hanna @.>, wrote: Hi! I was reading the tutorial and tried the "cmbuild" command but my terminal always says the command was not found. I've already configured the program but the error persists How can I solve this? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

Thank you!

queirozhanna commented 11 months ago

Like hmmer, with make programs like cmbuild will be installed in infernal-1.1.5/src/ so running ./src/cmbuild tRNA5.cm tutorial/tRNA.5.sto from the infernal-1.1.5 directory should work.

Installation instructions are here: https://github.com/EddyRivasLab/infernal/blob/master/README.md#to-download-and-build-the-current-source-code-release

Thanks!