Update trackhubs - Githubissues

The Rfam track hubs are currently updated only at major releases and the procedure is not automated.

We need to develop a NextFlow pipeline that takes GCA/GCF accessions as input and does the following:

download fasta files using NCBI CLI tool as discussed in #118
compute -Z for each genome as explained in Rfam docs

randomly partition Rfam.cm into sets of 100 models

cmstat $RFAMCM | grep -v ^\# | awk '{ print $3 }' | shuf > all.list
split -l 100 all.list cm.
count=0
for filename in cm.*; do cmfetch -f $RFAMCM $filename > rand.$count.cm; ((count++)); done
rm cm.*

run cmsearch with each cm set against each fasta file

bsub -n 8 -M 12000 "cmsearch -o <name.out> --cpu 8 -Z <genome-score> --tblout <name.tblout> --cut_ga --rfam --nohmmonly rand.1.cm chr10.fasta"

concatenate tblout files and remove overlaps using cmsearch_tblout_deoverlap
de-overlapped .tblout and the original cmsearch .out files can be gzipped and stored on FTP
de-overlapped .tblout files can be used to generate a trackhub using tblout2bigBed.pl or tblout2bigBedGenomes.pl

Bonus point: add secondary structure directly into the BED file as an additional field using the BED detail format. See RT90465 for background. Unfortunately I could not track down the code that was used for that ticket.

Rfam / rfam-production

Update trackhubs #119