ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

all .pro assemblies #242

Open rchikhi opened 3 years ago

rchikhi commented 3 years ago

This thread will be for updates of the .pro assemblies.

number of .pro.gz files analyzed (all of s3://serratus-public/out/21 except *r1p\):

5,726,283

number of .fasta.gz obtained after converting .pro to FASTA and discarding empty files:

3,379,127

rchikhi commented 3 years ago

Assemblies done (measured by before_rr.fasta existing):

3,378,813

(no idea why in ~300 cases, no before_rr.fasta was created)

Number of empty assemblies:

2,890,521

Thus, non-empty assemblies (i.e. both before_rr.fasta and contigs.fasta exist and are non-empty):

488,292 (14.4%)

For reference, 19% of the rVert assemblies were non-empty.

asl commented 3 years ago

@rchikhi

(no idea why in ~300 cases, no before_rr.fasta was created)

Likely the assembly failed. Can you collect few logs out there?

rchikhi commented 3 years ago

Can do, let me just finish with the bulk of the results first.

Number of non-empty trim.LHF.fa motifator files:

168,460

rcedgar commented 3 years ago

Hi @rchikhi Minor feature request/suggestion for future runs: can you combine all micro-assemblies into one FASTA file? This file should not be too big, only around 1 Gb or so. This would be easier to process on Linux than millions of small FASTAs or millions of directories, each with a small/empty FASTA. This would require embedding the SRA identifier in the sequence label a.k.a. FASTA defline, e.g. as a prefix >SRA1234567|NODE_1..., something like that.

rchikhi commented 3 years ago

Data availability

Individual assemblies (excluding empty files):

s3://serratus-rayan/pro-assembly/individual/

Individual motifator analyses of the above assemblies:

s3://serratus-rayan/pro-assembly/individual_motifator/

For download convenience, the above two folders (assemblies and motifator analyses) are packaged into a tar.gz file each:

s3://serratus-rayan/pro-assembly/individual_assemblies.tar.gz s3://serratus-rayan/pro-assembly/individual_motifator.tar.gz

All these folders are relatively small (~10GB) but have in the order of millions of files.

rchikhi commented 3 years ago

In addition, for @rcedgar, here are all the motifator outputs (just the LHF files) concatenated into a single file:

s3://serratus-rayan/pro-assembly/all.before_rr.LHF.fasta s3://serratus-rayan/pro-assembly/all.contigs.LHF.fasta

SRR id is added as follows: >[SRR id][a single space][contig name] e.g. >SRR0123123 NODE_1_xxx.

rchikhi commented 3 years ago

And concatenated unitigs/contigs:

s3://serratus-rayan/pro-assembly/all.before_rr.fasta s3://serratus-rayan/pro-assembly/all.contigs.fasta

rchikhi commented 3 years ago

For reference, these assemblies were performed using that script:

https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/assemble_individually.sh

and motifator was run using that script:

https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/motifator_analyses/run_indiv.sh

rchikhi commented 3 years ago

here's an exhaustive list of "reads" that are above 600 bp among the single-end libraries:

https://serratus-rayan.s3.amazonaws.com/rdrp-pan-assembly/prelim/all_se.above_600bp.txt

from that list I extracted the set of 719 accessions that are deemed not to be Illumina short reads:

https://serratus-rayan.s3.amazonaws.com/rdrp-pan-assembly/prelim/nonILMN.txt

rchikhi commented 3 years ago

Coverage analysis of the motifator hits within the .pro assemblies

s3://serratus-rayan/pro-assembly/depth_summary.csv

schema: sra, header, contig_type, p_cvg1, p_cvg2, p_cvg3-4, p_cvg5-8, p_cvg9plus

where p_cvgX is the percentage of bases of the region where coverage is >= X

code used to generate those results https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/bed_analysis.sh https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/depth_analysis.sh https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/depth_summary.py