ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

Migrate assembly data to lovelywater #237

Open ababaian opened 3 years ago

ababaian commented 3 years ago

We need to migrate all the assembly and annotation data generated as part of Serratus to our data-lake in a structured way so as to allow for programmatic access. This is a proposed folder hierarchy to discuss wherewe have $SRA as the accession-variable

Similar to the rest of the archive, I propose 'flat' folders broken up by major category and every file contains a $SRA prefex. So no contig/$SRA/$SRA.data.fa or contig/$SRA/data.tsv cases.

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments
├ cov_index.tsv       # Index file of CoV+ libraries
└ assembly_index.tsv  # Index file of assembled SRA libraries

assembly/cov/$SRA.cov.fa : Contigs identified to be CoV (i.e. 12K paper is based on)

contigs/ : The coronaSPAdes output files such as $SRA.inputdata.txt, $SRA.coronaspdes.txt, $SRA.coronaspdes.gene_clusters.fa ... $SRA.coronaspdes.assembly_graph_with_scaffolds.gfa.gz

annotation/

gz/ : I was originally thinking of also storing the data as a single $SRA.tar.gz file containing cov/ contig/ and annotation/ data but this will duplicate the data and is probably not a good idea. Instead we can provide a short grabSRA.sh $SRA script which will automatically download all the files associated with a particular $SRA to the local system for users.

rchikhi commented 3 years ago

it's all staged in s3://serratus-rayan/lovelywater/assembly, please have a look before transferring to lovelywater.

Name Size
annotation/ 73.8 GB
cov/ 169.2 MB
contigs/ 4.0 TB
rchikhi commented 3 years ago

TODO for me next:

taltman commented 3 years ago

The README.md in the top-level of lovelywater is out-of-sync with the bucket directory structure.

ababaian commented 3 years ago

Most recent version is always on the Data Access Page

taltman commented 3 years ago

That page is also inconsistent. In Naming Conventions, it uses as an example, s3://lovelywater/contig/SRA123456.fa. In the Folder Organization section, there is no such folder contig, and there is no such directory in the bucket (as far as I can see).

ababaian commented 3 years ago

The data for assemblies has not been migrating on it, once that's done it closes this issue.

edit: updated the access page to reflect situation on the ground

rchikhi commented 3 years ago

Satellites assemblies have been migrated, to s3://serratus-rayan/lovelywater/assembly/contigs i.e. same location as other CoV assembly data. For some reason, I can't find satellites' scaffolds.fasta files, only the gene_clusters.fasta are present. I tend to think I might have never copied scaffolds.fasta to S3 (likely due to a past bug that has recently been fixed) and it's likely that we were only interested in gene_clusters.fasta during the satellite analysis.

ababaian commented 3 years ago

c'est la vie. Is this the complete collection of assemblies then?

rchikhi commented 3 years ago

nope, i'm in the process of moving dicistro/quenya assemblies too, will let you know when it's over

rchikhi commented 3 years ago

done! dicistro, quenya, satellites assemblies are copied.

total number of accessions assembled in s3://serratus-rayan/lovelywater/assembly/contigs: 56,071 total size of ̀s3://serratus-rayan/lovelywater/: 4.9 TB scaffolds from CoV assemblies (MFC-compressed): 0.9 TB scaffolds from other assemblies (gzip-compressed): 0.2 TB assembly graphs (gzip-compressed): 1.6 TB (These could be deleted, but at the same time keeping them would enable to quickly regenerate assemblies e.g. after a coronaSPAdes update, or to get the missing scaffolds.fasta files)

Darth annotations of checkv-filtered gene_clusters (gzip-compressed): 2.0 TB Some of those somehow made their way to the contigs/ folder. Among these, some contain a huge BAM file of reads aligned to contigs, hence the space usage. This was needed for quality control. They could be deleted, as for each of those there is another gzip file without the BAM file. Two options: 1) delete the large BAM-containing Darth archives and move the small ones to into annotation/ folder 2) keep everything and move all darth stuff to annotation/ folder any preference?

rchikhi commented 3 years ago

Also there is the 1k subset of accession assemblies found by the .pro analysis, wanna include it?

ababaian commented 3 years ago

yes

rchikhi commented 3 years ago

1ksubset: migration done

rchikhi commented 3 years ago

after some Slack discussions:

so I think we're done

rchikhi commented 3 years ago

hold on, i'll also move checkV analysis from contigs/ to annotation/

rchikhi commented 3 years ago

done! Here's the final content of

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments

as staged in s3://serratus-rayan/lovelywater/assembly/.

assembly/cov:

These are the 11,120 coronavirus assemblies made with coronaSPAdes, where contigs have been filtered either using CheckV or using coronaSPAdes' bgc-statistics. See Serratus' manuscript for more details.

assembly/contigs:

SRRXXXXXX.[assembler].assembly_graph_with_scaffolds.gfa.gz
SRRXXXXXX.[assembler].bgc_statistics.txt
SRRXXXXXX.[assembler].contigs.fa.mfc
SRRXXXXXX.[assembler].domain_graph.dot
SRRXXXXXX.[assembler].gene_clusters.fa
SRRXXXXXX.[assembler].scaffolds.fasta.gz
SRRXXXXXX.[assembler].scaffolds.paths
SRRXXXXXX.[assembler].log
SRRXXXXXX.[assembler].txt

All of these are [assembler] outputs, where [assembler] is either coronaSPAdes or rnaviralSPAdes. Depending on the assembler, a subset of these files will be present for each accession. Beware: contigs.fa.mfc actually contains the content of coronaSPAdes' scaffolds.fasta compressed with MFCompress.

assembly/annotation:

This folder contains the annotation results of several programs applied to different inputs.

CheckV applied to the scaffolds.fasta and/or gene_clusters.fasta:

SRRXXXXXX.[assembler].checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].checkv.quality_summary.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.quality_summary.tsv.gz

serraplace (taxonomic placement) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serraplace.tar.gz
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.final

serratax (taxonomic identification) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.tar.gz

Then, the following are annotations of the assemblies in cov/. They include the outputs of Darth, a pipeline created within Serratus for annotation of coronavirus assemblies.

SRRXXXXXX.fa.darth.alignments.fasta
SRRXXXXXX.fa.darth.alignments.sto
SRRXXXXXX.fa.darth.input_md5
SRRXXXXXX.fa.darth.stripped.tar.gz
SRRXXXXXX.fa.darth.tar.gz
SRRXXXXXX.fa.darth.transeq.alignments.fasta
SRRXXXXXX.fa.serraplace.tar.gz
SRRXXXXXX.fa.serratax.final
SRRXXXXXX.fa.serratax.tar.gz
ababaian commented 3 years ago

I'll begin data migration shortly!

ababaian commented 3 years ago

Take a look at s3://lovelywater/assembly/ and let me know if that looks alright.

Also updated the

If that looks good then close this baby!

taltman commented 3 years ago

What's the status on this? Should I be pulling data from s3://serratus-rayan/lovelywater/assembly/cov/ or s3://lovelywater/assembly/cov/?

ababaian commented 3 years ago

either is fine they are identical. Migration is now complete. I think we're good to close this @rchikhi

rchikhi commented 3 years ago

Same number of files and size as my folder, looks good

Total Objects: 671859
   Total Size: 3.2 TiB
rchikhi commented 3 years ago

so, this issue is closed yet I noticed today that we never deleted anything off the original location s3://serratus-public/assemblies (thought the staged location s3://serratus-rayan/lovelywater got correctly cleared). The original location still contains all the migrated data + some other less useful and non-migrated accessions, like those with partially failed assemblies, a few minia assemblies that coronaspades didn't assemble, etc. I see 48268 coronaspades assemblies on lovelywater and 51756 coronaspades folders on serratus-public (with possibly empty in some cases). @ababaian, a few options: 1) delete from s3://serratus-public/assemblies only the migrated stuff 2) delete everything from s3://serratus-public/assemblies 3) keep s3://serratus-public/assemblies for some reason

I'd go for 1)

ababaian commented 3 years ago

One consideration is serratus-public currently has version control, so you have to do a 2-pass deletion (delete file, and delete history) to remove data. We do need to do this this but I've been delaying until the paper is "done" so we don't whoopsy and lose some data we need. I'll re-open and let's go with (2) once the paper is done is my take. I'll reopen the issue.