Closed rcedgar closed 4 years ago
The TSV seems to be broken:
SRR10951664 100 98 1143 2020-01-23 19:16:10 SRX7618621
SRR10951662 100 98 1197 2020-01-23 19:16:10 SRX7618620
SRR10951658 100 98 2035 2020-01-23 19:16:10 SRX7618618
SRR10951657 100 93 1120 2020-01-23 19:16:10 SRX7618617
SRR10951654 100 98 1855 2020-01-23 19:16:10 SRX7618616
Also, there are plenty of datasets with reads of 50 bp and even 36 bp (!). I would strongly suggest to ignore everything < 50 bp (and assemble < 76bp reads only with sufficient coverage)
Bad lines. Agreed there are some incorrect lines, sorry for that. The SRA accessions are correct, the meta-data fields were included in the hope that they would be helpful, free to ignore them otherwise. The errors are probably due to using the wrong options for the csvformat utility which I use to convert RunInfo from csv (terrible non-standardized format!) to tsv.
Difficult to impossible datasets are included by design because we want to better understand what can (and cannot) be assembled best case. Very short reads and low coverage sets can be assembled by reference-guided alignment.
Assemblies should be uploaded to s3://serratus-public/
as follows:
Contigs should be in FASTA format under assemblies/contigs/SRAxxxx.yyyy.fasta
, where xxxx
is the SRA accession number and yyyy
is the name of the assembly method. If the same assembler is used by different teams, each team should use a different yyyy
name. This should include CoV contigs only, contigs for host and other virus families should be posted to other/
if applicable.
Other files should be uploaded to assemblies/other/SRAxxxx.yyyy/
. This is a directory you create for a given SRA+assembler combination. Other files such as graphs, CheckV
reports etc. etc. can be stored here. Choose any convenient convention for each assembler; please be consistent if possible.
Please make sure uploaded files are public-readable.
861 out of the 1000 accession have been assembled with Minia as of today, and will run coronaSPAdes next. A few words on the S3 directory structure. It adheres to @rcedgar's proposal on the post above:
in assemblies/contigs/
:
SRRxxx.minia.checkv_filtered.fa
Minia k31 contigs filtered by CheckV, keeping only coronavirus hitsSRRxxx.coronaspades.checkv_filtered.fa
coronaSPAdes scaffolds filtered by CheckV, keeping only coronavirus hitsSRRxxx.coronaspades.gene_clusters.fa
coronaSPAdes' gene_clusters.fasta
(you can think of them as contigs obtained by matching to a coronavirus HMM but the construction process is more complex than that!)SRRxxx.coronaspades.gene_clusters.checkv_filtered.fa
coronaSPAdes gene_clusters.fasta
further filtered by CheckVin assemblies/other/SRRxxx.[assembler]/
:
SRRxxx.[assembler].contigs.fa.mfc
unfiltered, straight out-of-the-assembler, MFCompress'd contigs (Minia) or scaffolds (coronaSPAdes) file. This file contains the whole assembly of the dataset, hence host contigs too, and also other viruses.SRRxxx.inputdata.txt
some statistics about the reads (number of reads, FASTQ file size)SRRxxx.[assembler].checkv.[completeness|contamination|quality_summary].tsv.gz
output of CheckV on the whole assembly file (i.e. contigs.fa.mfc
)SRRxxx.coronaspades.gene_clusters.checkv.[completeness|contamination|quality_summary].tsv.gz
output of CheckV on the gene_clusters.fasta
file (for coronaSPAdes)SRRxxx.[assembler].txt
output log of the assemblerthe magic that performed this is https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly
I'll mention also that many files in assemblies/contigs/
will be empty: this is by design. It means that the assembler processed the dataset yet CheckV found no viral contig in the assembly output.
Excellent!! Why "only" 861/1000, any problem with the remaining 139?
Ah yes, great question. Let me break the rest down: 99 are still running, those are the big files. 21 where reads failed to download (those are even bigger files). The rest (~19 datasets) are mysterious failures that I'll keep investigating in the next few days. update: some of those failures are because reads are too short to be assembled with k=31.
As of now, 951/1000 minia assemblies are done (will investigate the missing 50 but with lower-priority), and 413/1000 coronaspades (more will arrive soon).
SRRxxx.coronaspades.gene_clusters.fa coronaSPAdes gene_clusters.fasta output (only the contigs matching coronavirus HMM)
Note that entries there are likely to be longer than those in scaffolds. So you may want to use checkv on them as well to filter out short spurious matches and only keep fully assembled sequences
alright, can run checkv on that output too. Is gene_clusters.fa
a strict subset of scaffolds.fasta
? So that I don't need to keep the intermediate checkV output for that file as well
No. It's not a subset. bgcSPAdes / coronaSPAdes does not work this way :) I would probably need to write a note showing what it is doing :)
That was just a guess :) ok! then I'll keep the output of checkv for gene_clusters as well.
967/1000 minia assemblies and 913/1000 coronaspades assemblies (which are of much higher quality). Only 5 coronaspades assemblies and 16 minia assemblies are still running in the pipeline, so one should consider this 1K test to be very soon over, and ready to move to larger data.
1K test is done, 983 minia and 914 coronaspades assemblies, some stats follow. All assemblies considered here are the CheckV-filtered versions.
Number of datasets with X contigs
4 somewhat arbitrary categories: A. 1 contig of length x > 25 Kbp (perfect or near-perfect genomes) B. >1 contigs of total length x > 25 Kbp (need some work but possibly salvageable genomes) C. total length 5 kbp < x < 25 Kbp (incomplete genomes) D. >= 1 contigs not satisfying any of the above conditions E. 0 contigs
I suggest we go ahead with all the other analyses (tax assignment, tree, etc) by taking only the coronaSPAdes cat A's, and meanwhile, optionally, work on increasing the number of cat A assemblies by salvaging from the other categories using improved methods (e.g. coassembly)
@rchikhi So, these are checkv filtered contigs, not gene clusters for coronaSPAdes, right?
Ah yes, absolutely!
This is what it looks like for gene_clusters:
Raw analysis data: https://serratus-public.s3.amazonaws.com/assemblies/analysis/1K_stats.csv
Oh, yes. Since gene_clusters is unfiltered, then category B is inflated.
Yes. Going to fix that by adding gene_clusters.checkv_filtered to this plot ASAP
Here they are:
Bottom line: gene_clusters.checkv_filtered.fa
are our best cat-A assemblies so far. They're all in s3://serratus-public/assemblies/contigs/
and I've moved gene_clusters.fa
to s3://serratus-public/assemblies/other/
I've also prepared the list of all 135 cat-A gene_clusters.checkv_filtered.fa
here:
https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA-v1.txt
Could be a good small dataset for testing Serratax, or misassembly detection, or basically all that follows..
Update: a new batch of 3.2k assemblies was made yesterday. This time I only look at coronaspades.gene_clusters.checkv_filtered.fa
. Here are the number of assemblies per category, grouping together the previous 1K batch and the new 3.2K batch:
This brings the number of our category A assemblies (single-contig, length > 25 Kbp) to 388. Here is the new list: s3://serratus-public/assemblies/analysis/catA-v2.txt
It seems that roughly 10% of the assembled datasets lead to high quality assemblies out of the box. Not high, but still much better than 0%. With 10k-20k assembled datasets we should reach 1k assemblies.
Bonus: all category B assemblies (> 25 kbp total but split into 2 contigs or more) : s3://serratus-public/assemblies/analysis/catB-v2.txt
Test completed, closing issue.
Master tsv file for fist 1k assembler targets here:
https://serratus-public.s3.amazonaws.com/rce/assembly/assembly_targets_first_1k.tsv
Fields in the tsv file are:
These are designed to include a wide variety of typical cases ranging from easy to assemble to impossible to assemble.