Open AntonPetrov opened 2 years ago
The Rfam track hubs are currently updated only at major releases and the procedure is not automated.
We need to develop a NextFlow pipeline that takes GCA/GCF accessions as input and does the following:
-Z
cmstat $RFAMCM | grep -v ^\# | awk '{ print $3 }' | shuf > all.list split -l 100 all.list cm. count=0 for filename in cm.*; do cmfetch -f $RFAMCM $filename > rand.$count.cm; ((count++)); done rm cm.*
bsub -n 8 -M 12000 "cmsearch -o <name.out> --cpu 8 -Z <genome-score> --tblout <name.tblout> --cut_ga --rfam --nohmmonly rand.1.cm chr10.fasta"
.tblout
.out
Bonus point: add secondary structure directly into the BED file as an additional field using the BED detail format. See RT90465 for background. Unfortunately I could not track down the code that was used for that ticket.
BED detail
The Rfam track hubs are currently updated only at major releases and the procedure is not automated.
We need to develop a NextFlow pipeline that takes GCA/GCF accessions as input and does the following:
-Z
for each genome as explained in Rfam docs.tblout
and the original cmsearch.out
files can be gzipped and stored on FTP.tblout
files can be used to generate a trackhub using tblout2bigBed.pl or tblout2bigBedGenomes.plBonus point: add secondary structure directly into the BED file as an additional field using the
BED detail
format. See RT90465 for background. Unfortunately I could not track down the code that was used for that ticket.