Closed sjanssen2 closed 10 months ago
@sjanssen2 : can you please attach the full tblout file, or better yet a minimal set of files with command that reproduces this?
This behavior is expected because the spacing for cmsearch
tabular output can change for each CM (query). Based on what you provided above, it seems that the maximum width string for the target name
field for all hits for U1 was one character less than for U2, for example. But I would have to see the full tblout file to verify that.
cmds:
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/14.9/Rfam.cm.gz
gunzip Rfam.cm.gz
https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
zcat hg19.fa.gz > hg19.fa
cmsearch --tblout cmsearch.Rfam.hg19.tbl Rfam.cm hg19.fa
@sjanssen2 : I think that we simultaneously commented, as unlikely as that seems. Does my second comment from yesterday answer your question? If you'd like me to check the tblout file, if you can grep out the U1 and U2 lines and send me that, then I should be able to verify if what I wrote above applies here.
Hi @nawrockie you are right, we cross-posted :-) Here comes the subset of U1 and U2: U1U2.tbl
My goal is to parse the (tabular) output of cmsearch into a pandas.DataFrame. I first thought I can use \t
as separators, but there aren't any. Thus I tried to orient on the number of -
in below the header line. If U1 and U2 and others in fact use different column width, this is also not possible. Thus I started splitting on whitespaces \s+
. Is my assumption correct, that field contents will never contain
as non-separators?
And why are there columns without header names? In the stdout output, I find tables like
Query: RNaseP_nuc [CLEN=303]
Accession: RF00009
Description: Nuclear RNase P
Hit scores:
rank E-value score bias sequence start end mdl trunc gc description
---- --------- ------ ----- -------- --------- --------- --- ----- ---- -----------
(1) ! 5e-74 240.1 5.5 chr14 20811566 20811234 - cm no 0.65 -
(2) ! 4.8e-30 107.5 0.0 chr4 157907845 157907524 - cm no 0.53 -
(3) ! 0.0046 26.2 0.0 chr14 65376498 65376206 - cm no 0.45 -
------ inclusion threshold ------
(4) ? 0.067 22.7 0.0 chr16 50615708 50615324 - cm no 0.50 -
(5) ? 0.11 22.1 0.0 chr4 154785514 154785635 + cm no 0.40 -
The strand information does not come with a header name (which would not fit anyways) but also without a -
to indicate the column.
What is your suggestion on parsing Infernal output into pandas DataFrames or other tabular data structures?
The spacing difference is explained by a difference in the max length string in target name
field of the tblout file. For U1, this is chr1_gl000192_random
and for U2 it is chr17_gl000203_random
.
Yes, you should be able to split on \s+
as whitespace won't occur as a non-separator except for the final description
field, but since that occurs at the end there is usually a not so terrible way to work around that with a parser, in my experience.
My suggestion about parsing cmsearch output is to use the tblout file as that was intended for easier parsing than the standard output.
Hi! I was reading the tutorial and tried the "cmbuild" command but my terminal always says the command was not found. I've already configured the program but the error persists How can I solve this?
Hanna,
How did you install Infernal? In particular, did you do ‘make install’ or just ‘make’? If you ran ‘make install’, which you’d probably have to do as root, the Infernal programs should be installed in places that your default PATH will find them. If you just ran ‘make’, the Infernal programs will be built somewhere in your Infernal-1.1.1 directory and you’ll need to provide the correct path to them to run them. In HMMER, the programs are built in the ‘src’ sub-directory, but I haven’t worked with Infernal.
Here’s something that might help: 1) run ‘find . -name cmbuild’ from your home directory. If you’ve built Infernal correctly, that should return the path from your home directory to the cmbuild program. You can then run cmbuild by entering that full path. For example, here’s me doing something similar with a HMMER install on my machine:
@.:~/h3-server$ find . -name hmmscan ./src/hmmscan @.:~/h3-server$ ./src/hmmscan -h
Usage: hmmscan [-options]
To be able to use cmbuild more conveniently, you can add the directory where cmbuild and the other Infernal programs reside to your PATH variable. A web search should return lots of instructions on how to do this if you aren’t familiar with that process.
-Nick
On Nov 2, 2023 at 4:01 PM -0400, Hanna @.***>, wrote:
Hi! I was reading the tutorial and tried the "cmbuild" command but my terminal always says the command was not found. I've already configured the program but the error persists
How can I solve this? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Like hmmer
, with make
programs like cmbuild
will be installed in infernal-1.1.5/src/
so running
./src/cmbuild tRNA5.cm tutorial/tRNA.5.sto
from the infernal-1.1.5
directory should work.
Installation instructions are here: https://github.com/EddyRivasLab/infernal/blob/master/README.md#to-download-and-build-the-current-source-code-release
Hanna, How did you install Infernal? In particular, did you do ‘make install’ or just ‘make’? If you ran ‘make install’, which you’d probably have to do as root, the Infernal programs should be installed in places that your default PATH will find them. If you just ran ‘make’, the Infernal programs will be built somewhere in your Infernal-1.1.1 directory and you’ll need to provide the correct path to them to run them. In HMMER, the programs are built in the ‘src’ sub-directory, but I haven’t worked with Infernal. Here’s something that might help: 1) run ‘find . -name cmbuild’ from your home directory. If you’ve built Infernal correctly, that should return the path from your home directory to the cmbuild program. You can then run cmbuild by entering that full path. For example, here’s me doing something similar with a HMMER install on my machine: @.:~/h3-server$ find . -name hmmscan ./src/hmmscan @.:~/h3-server$ ./src/hmmscan -h # hmmscan :: search sequence(s) against a profile database # HMMER 3.3.2 (Nov 2020); http://hmmer.org/ # Copyright (C) 2020 Howard Hughes Medical Institute. # Freely distributed under the BSD open source license. # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Usage: hmmscan [-options]
To be able to use cmbuild more conveniently, you can add the directory where cmbuild and the other Infernal programs reside to your PATH variable. A web search should return lots of instructions on how to do this if you aren’t familiar with that process. … -Nick On Nov 2, 2023 at 4:01 PM -0400, Hanna @.>, wrote: Hi! I was reading the tutorial and tried the "cmbuild" command but my terminal always says the command was not found. I've already configured the program but the error persists How can I solve this? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>
Thank you!
Like
hmmer
, withmake
programs likecmbuild
will be installed ininfernal-1.1.5/src/
so running./src/cmbuild tRNA5.cm tutorial/tRNA.5.sto
from theinfernal-1.1.5
directory should work.Installation instructions are here: https://github.com/EddyRivasLab/infernal/blob/master/README.md#to-download-and-build-the-current-source-code-release
Thanks!
I found an offset in the starting positions of columns when using
cmsearch
with the 14.9 Rfam collection against the UCSC derived hg19 assembly.If you take a look at the second column
accession
the fields are all-
but this character seems to jump back and forth from position 22 to 23. This breaks my current parser and I wonder if this is intended or not. The above example is produced with version 1.1.3 onLinux rs500-bcf-10 5.15.0-78-generic #85~20.04.1-Ubuntu SMP Mon Jul 17 09:42:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
I am currently running INFERNAL 1.1.5 (Sep 2023) to see if this issue persists ... yes, it does