First 1k assembler targets

rcedgar commented 4 years ago

Master tsv file for fist 1k assembler targets here:

https://serratus-public.s3.amazonaws.com/rce/assembly/assembly_targets_first_1k.tsv

Fields in the tsv file are:

SRA
Summarizer score
PctId with top reference sequence
Number of bowtie2 alignments
Read length
SINGLE/PAIRED

These are designed to include a wide variety of typical cases ranging from easy to assemble to impossible to assemble.

asl commented 4 years ago

The TSV seems to be broken:

SRR10951664     100     98      1143    2020-01-23 19:16:10     SRX7618621
SRR10951662     100     98      1197    2020-01-23 19:16:10     SRX7618620
SRR10951658     100     98      2035    2020-01-23 19:16:10     SRX7618618
SRR10951657     100     93      1120    2020-01-23 19:16:10     SRX7618617
SRR10951654     100     98      1855    2020-01-23 19:16:10     SRX7618616

Also, there are plenty of datasets with reads of 50 bp and even 36 bp (!). I would strongly suggest to ignore everything < 50 bp (and assemble < 76bp reads only with sufficient coverage)

rcedgar commented 4 years ago

Bad lines. Agreed there are some incorrect lines, sorry for that. The SRA accessions are correct, the meta-data fields were included in the hope that they would be helpful, free to ignore them otherwise. The errors are probably due to using the wrong options for the csvformat utility which I use to convert RunInfo from csv (terrible non-standardized format!) to tsv.
Difficult to impossible datasets are included by design because we want to better understand what can (and cannot) be assembled best case. Very short reads and low coverage sets can be assembled by reference-guided alignment.

rcedgar commented 4 years ago

Assemblies should be uploaded to s3://serratus-public/ as follows:

Contigs should be in FASTA format under assemblies/contigs/SRAxxxx.yyyy.fasta, where xxxx is the SRA accession number and yyyy is the name of the assembly method. If the same assembler is used by different teams, each team should use a different yyyy name. This should include CoV contigs only, contigs for host and other virus families should be posted to other/ if applicable.

Other files should be uploaded to assemblies/other/SRAxxxx.yyyy/. This is a directory you create for a given SRA+assembler combination. Other files such as graphs, CheckV reports etc. etc. can be stored here. Choose any convenient convention for each assembler; please be consistent if possible.

Please make sure uploaded files are public-readable.

rchikhi commented 4 years ago

861 out of the 1000 accession have been assembled with Minia as of today, and will run coronaSPAdes next. A few words on the S3 directory structure. It adheres to @rcedgar's proposal on the post above:

in assemblies/contigs/:

SRRxxx.minia.checkv_filtered.fa Minia k31 contigs filtered by CheckV, keeping only coronavirus hits
SRRxxx.coronaspades.checkv_filtered.fa coronaSPAdes scaffolds filtered by CheckV, keeping only coronavirus hits
SRRxxx.coronaspades.gene_clusters.fa coronaSPAdes' gene_clusters.fasta (you can think of them as contigs obtained by matching to a coronavirus HMM but the construction process is more complex than that!)
SRRxxx.coronaspades.gene_clusters.checkv_filtered.fa coronaSPAdes gene_clusters.fasta further filtered by CheckV

in assemblies/other/SRRxxx.[assembler]/:

SRRxxx.[assembler].contigs.fa.mfc unfiltered, straight out-of-the-assembler, MFCompress'd contigs (Minia) or scaffolds (coronaSPAdes) file. This file contains the whole assembly of the dataset, hence host contigs too, and also other viruses.
SRRxxx.inputdata.txt some statistics about the reads (number of reads, FASTQ file size)
SRRxxx.[assembler].checkv.[completeness|contamination|quality_summary].tsv.gz output of CheckV on the whole assembly file (i.e. contigs.fa.mfc)
SRRxxx.coronaspades.gene_clusters.checkv.[completeness|contamination|quality_summary].tsv.gz output of CheckV on the gene_clusters.fasta file (for coronaSPAdes)
SRRxxx.[assembler].txt output log of the assembler

the magic that performed this is https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly

rchikhi commented 4 years ago

I'll mention also that many files in assemblies/contigs/ will be empty: this is by design. It means that the assembler processed the dataset yet CheckV found no viral contig in the assembly output.

rcedgar commented 4 years ago

Excellent!! Why "only" 861/1000, any problem with the remaining 139?

rchikhi commented 4 years ago

Ah yes, great question. Let me break the rest down: 99 are still running, those are the big files. 21 where reads failed to download (those are even bigger files). The rest (~19 datasets) are mysterious failures that I'll keep investigating in the next few days. update: some of those failures are because reads are too short to be assembled with k=31.

rchikhi commented 4 years ago

As of now, 951/1000 minia assemblies are done (will investigate the missing 50 but with lower-priority), and 413/1000 coronaspades (more will arrive soon).

asl commented 4 years ago

SRRxxx.coronaspades.gene_clusters.fa coronaSPAdes gene_clusters.fasta output (only the contigs matching coronavirus HMM)

Note that entries there are likely to be longer than those in scaffolds. So you may want to use checkv on them as well to filter out short spurious matches and only keep fully assembled sequences

rchikhi commented 4 years ago

alright, can run checkv on that output too. Is gene_clusters.fa a strict subset of scaffolds.fasta? So that I don't need to keep the intermediate checkV output for that file as well

asl commented 4 years ago

No. It's not a subset. bgcSPAdes / coronaSPAdes does not work this way :) I would probably need to write a note showing what it is doing :)

rchikhi commented 4 years ago

That was just a guess :) ok! then I'll keep the output of checkv for gene_clusters as well.

rchikhi commented 4 years ago

967/1000 minia assemblies and 913/1000 coronaspades assemblies (which are of much higher quality). Only 5 coronaspades assemblies and 16 minia assemblies are still running in the pipeline, so one should consider this 1K test to be very soon over, and ready to move to larger data.

rchikhi commented 4 years ago

1K test is done, 983 minia and 914 coronaspades assemblies, some stats follow. All assemblies considered here are the CheckV-filtered versions.

Number of datasets with X contigs

4 somewhat arbitrary categories: A. 1 contig of length x > 25 Kbp (perfect or near-perfect genomes) B. >1 contigs of total length x > 25 Kbp (need some work but possibly salvageable genomes) C. total length 5 kbp < x < 25 Kbp (incomplete genomes) D. >= 1 contigs not satisfying any of the above conditions E. 0 contigs

rchikhi commented 4 years ago

I suggest we go ahead with all the other analyses (tax assignment, tree, etc) by taking only the coronaSPAdes cat A's, and meanwhile, optionally, work on increasing the number of cat A assemblies by salvaging from the other categories using improved methods (e.g. coassembly)

asl commented 4 years ago

@rchikhi So, these are checkv filtered contigs, not gene clusters for coronaSPAdes, right?

rchikhi commented 4 years ago

Ah yes, absolutely!

rchikhi commented 4 years ago

This is what it looks like for gene_clusters:

rchikhi commented 4 years ago

Raw analysis data: https://serratus-public.s3.amazonaws.com/assemblies/analysis/1K_stats.csv

asl commented 4 years ago

Oh, yes. Since gene_clusters is unfiltered, then category B is inflated.

rchikhi commented 4 years ago

Yes. Going to fix that by adding gene_clusters.checkv_filtered to this plot ASAP

rchikhi commented 4 years ago

Here they are: Bottom line: gene_clusters.checkv_filtered.fa are our best cat-A assemblies so far. They're all in s3://serratus-public/assemblies/contigs/ and I've moved gene_clusters.fa to s3://serratus-public/assemblies/other/

I've also prepared the list of all 135 cat-A gene_clusters.checkv_filtered.fa here: https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA-v1.txt

Could be a good small dataset for testing Serratax, or misassembly detection, or basically all that follows..

rchikhi commented 4 years ago

Update: a new batch of 3.2k assemblies was made yesterday. This time I only look at coronaspades.gene_clusters.checkv_filtered.fa. Here are the number of assemblies per category, grouping together the previous 1K batch and the new 3.2K batch:

This brings the number of our category A assemblies (single-contig, length > 25 Kbp) to 388. Here is the new list: s3://serratus-public/assemblies/analysis/catA-v2.txt

It seems that roughly 10% of the assembled datasets lead to high quality assemblies out of the box. Not high, but still much better than 0%. With 10k-20k assembled datasets we should reach 1k assemblies.

Bonus: all category B assemblies (> 25 kbp total but split into 2 contigs or more) : s3://serratus-public/assemblies/analysis/catB-v2.txt

rcedgar commented 4 years ago

Test completed, closing issue.

ababaian / serratus

First 1k assembler targets #162