EddyRivasLab / infernal

RNA secondary structure/sequence profiles for homology search and alignment
Other
101 stars 24 forks source link

a option to disable the output of unmapped query when --fmt 2 #9

Closed wangyugui closed 7 years ago

wangyugui commented 7 years ago

Hi.

Is there a option to disable the output of unmapped query when --fmt 2?

We need to output all mapped query only.

And a format like blast format 6 will be good too.

wangyugui commented 7 years ago

currently I am using

cmscan --cpu 2 --rfam --cut_ga --nohmmonly --tblout mrum-genome.tblout --noali --notextw --oskip --fmt 2 --clanin /usr/bio-ref/rfam-12.3/Rfam.clanin /usr/bio-ref/rfam-12.3/Rfam.cm -
nawrockie commented 7 years ago

Can you please provide an example of what you mean by 'unmapped query'?

It is not possible to output 'blast format 6' exactly, but the tabular output file (mrum-genome.tblout) is similar. That tabular file can be easily manipulated using grep and awk, like this:

cat mrum-genome.tblout | grep -v ^# | awk '{ printf("%s %s %s %s %s\n", $2, $4, $10, $11, $18); }'

which will print only the model name, sequence name, sequence start position, sequence stop position and E-value for all hits.

nawrockie commented 7 years ago

The tabular output file format is described on page 60 of the Infernal v1.1.2 user guide, available from: eddylab.org/infernal/.

wangyugui commented 7 years ago

an example of 'unmapped query':

Query:       TRINITY_DN94_c0_g1_i1  [L=455]
Description: top_iso:1 pct_iso_expr=100 pct_dom_iso_expr=100
Hit scores:
 rank     E-value  score  bias  modelname  start    end   mdl trunc   gc  description
 ----   --------- ------ -----  --------- ------ ------   --- ----- ----  -----------

   [No hits detected that satisfy reporting thresholds]

Internal CM pipeline statistics summary:
----------------------------------------
Query sequence(s):                                               1  (910 residues searched)
Query sequences re-searched for truncated hits:                  1  (706.3 residues re-searched, avg per model)
Target model(s):                                              2687  (346722 consensus positions)
Windows   passing  local HMM SSV           filter:            1920  (0.0669); expected (0.06)
Windows   passing  local HMM Viterbi       filter:             566  (0.01923); expected (0.02)
Windows   passing  local HMM Viterbi  bias filter:             300  (0.008982); expected (0.02)
Windows   passing  local HMM Forward       filter:              39  (0.00145); expected (0.0002)
Windows   passing  local HMM Forward  bias filter:              26  (0.001074); expected (0.0002)
Windows   passing glocal HMM Forward       filter:              11  (0.0004396); expected (0.0002)
Windows   passing glocal HMM Forward  bias filter:              11  (0.0004396); expected (0.0002)
Envelopes passing glocal HMM envelope defn filter:              11  (0.0003051); expected (0.0002)
Envelopes passing  local CM  CYK           filter:               3  (6.516e-05); expected (0.0001)
Total CM hits reported:                                          0  (0); includes 0 truncated hit(s)

Thanks for your info.

mrum-genome.tblout is what I wanted. I'm sorry that I though this 'tbl' as 'tblast' not as 'tabular '

wangyugui commented 7 years ago

The final process maybe useful for others.

cat seq.fa |
parallel --halt soon,fail=1 --gnu --no-notice --keep-order --pipe -j 12 --block 100k --recstart '>' \
    cmscan --cpu 2 --rfam --cut_ga --nohmmonly \
    --fmt 2 --clanin Rfam.clanin --tblout /dev/stderr \
    Rfam.cm -  >rfam.scan.txt 2>rfam.scan.tbl.txt