ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

rVert assemblies #241

Open rchikhi opened 3 years ago

rchikhi commented 3 years ago

Results are here: s3://serratus-rayan/rVert-assembly/

Folder structure:

Data:

pro/                          -- Original .pro data
fasta/                        -- Converted to .fasta
all_pe.fa                     -- Merged into single files
all_se.fa

Results:

individual/                       -- RNAViralSPAdes assembly of each SRAs individually
rnaviralspades_coassembly_k*/     -- RNAViralSPAdes co-assembly of all_pe.fa+all_se.fa with given k values

Scripts:

assemble_individually.sh  -- command line of RNAViralSPAdes assembly of each SRAs individually
coassemble.sh             -- command line of RNAViralSPAdes co-assembly of all_pe.fa+all_se.fa
pro_to_fasta.py           -- Conversion of .pro.gz files to .fasta
pro_to_fasta.sh           -- wrapper
upload.sh                 -- S3 upload
rchikhi commented 3 years ago

cmdline (1):

  transeq -frame 6 $input $input.aa
  base=$(basename $input)
  ./motifator   -search_rdrp $input.aa -model rdrp_model.txt  \
                -tsvout results/$base.tsv \
                -report results/$base.txt \
                -fevout results/$base.fev \
                -medhionly \
                -trim_fastaout   results/$base.trim.LHF.fa \
                -motifs_fastaout results/$base.motifs.fa
asl commented 3 years ago

To be 100% explicit: no HMMs here were involved :)

rchikhi commented 3 years ago

Update: re-uploaded s3://serratus-rayan/rVert-assembly/motifator-results/individual/ which, up until this comment, contained the wrong files (I had mistakingly run motifator on the .pro reads and not contigs).

Now motifator has been run on the individual SRA's for unitigs (before_rr.fasta), contigs and scaffolds.

rchikhi commented 3 years ago

@rcedgar asked "how many contigs give motifator hits"? To attempt to answer this, I ran:

$ grep "high-conf" *.tsv |cut -d"." -f1 |sort|uniq|wc -l
3388
$ grep "medium-conf" *.tsv |cut -d"." -f1 |sort|uniq|wc -l
112

How many SRA accessions were in rVert:

$ ls ../../fasta/ |cut -d"." -f1 |sort|uniq|wc -l
70070

(turns out many .pro files are empty, at least 20k) UPDATE: Due to a bug I didn't create fasta files for .pro files containing a single read, will re-run but it shouldnt change results much

How many SRAs were assembled into empty unitigs:

$ find ../individual/ -name "*.before_rr.fasta" -empty|wc -l
51888

How many non-empty contigs:

$ find ../individual/ -name "*.contigs.fasta" |wc -l
18182

Thus 3388/18182=18.6% non-empty contigs have a high-confidence RdRp hit. but if you count all SRAs including empty contigs, that number drops to 4.8%.

rchikhi commented 3 years ago

Out of the 18k rVert non-empty individual assemblies, 501 of them have different filesizes between before_rr.fasta and contigs.fasta (cc @asl). the difference is typically not big (~100 bp).

An extreme example: SRR3999033 (12.2kbp vs 8.8kbp). Other smaller ones: DRR032780, SRR3289253, SRR5085421. Assemblies are in s3://serratus-rayan/rVert-assembly/individual/