rcedgar commented 4 years ago

I think we should re-consider the folder naming conventions. Currently, the only way to find a given SRA is to list everything under /out and grep the accession. Also, there is no metadata (AFAIK) which gives the parameters (pan-genome etc.) which were used to run each directory. Automated access to the current structure is going to be increasingly difficult as we scale up.

I suggest the following structure.

(+) There is one top-level folder for each file type: bam/, trip/, summary/ etc.

(+) Each run has an identifier, e.g. zoo1, zoo2, megatest1 etc. Parameters for the run would be stored in a per-run directory, say:

run_desc/zoo1.txt run_desc/zoo2.txt run_desc/megatest1.txt ...

(+) For each run, the SRA run-info tsvs are provided:

run_info/zoo1.RunInfo.tsv run_info/zoo2.RunInfo.tsv ...

Ideal would be machine-readable metadata (e.g., name of pan-genome reference), for now an informal description would be fine (better than nothing!).

(+) Within each top level folder, subdirectory names are constructed from substrings of the SRA accession, e.g. for SRA123456 use 12/34/. This is to keep directory listings at a manageable size when there are millions of datasets.

(+) Filenames are constructed from SRA accession plus run identifier, e.g.

bam/SRA12/34/SRA123456.zoo1.bam bam/SRA12/34/SRA123456.megatest1.bam trip/SRA12/34/SRA123456.zoo1.tsv summary/SRA12/34/SRA123456.zoo1.tsv

With this approach, bam file(s) for SRA123456 match the pattern bam/SRA12/34/SRA123456.*, or do not exist.

@ababaian: I'm thinking we put all this data in a separate bucket altogether that is ReadOnly except for us so there's no accidental overwrites or anything like that. RCE: agreed.

ababaian commented 4 years ago

So right now the .bam file header will contain some of this meta-data but it is not easily accessible. I would say all meta-data per file should be stored in the header area of the summary files. We then can generate a database, in R or SQLite by reading and parsing the summary files.

Some information I would like to see included (and this repeats from above):

SRA accession
taxid of the species of origin of the library (or species name if taxid is unavailable)
Total number of reads in the library
date at which summary was generated (in case files get changed)

Let's brainstorm about this for a little bit as we go through the data now and later in the week we can have a call to formalize what we want and we can coordinate implementation.

rcedgar commented 4 years ago

I would like to see a report with one line per SRA accession sorted by decreasing detection score. Fields to include those you mention above, plus key information about the reads (2x150 HiSeq or whatever) plus the the full-length reference genome (or fragment) with the best coverage and the average %id between the reads and the reference sequence. This report is a summary of the summaries -- we need a new name for it to avoid confusion. How about Montserrat. I would regard this report as the primary output of Serratus.

ababaian commented 4 years ago

I like the name Montserrat :)

All my software naming convention has been using local mountain names, you're super lucky in that you get some of the worlds most famous mountains with that like HalfDome, or ElCapitan :P

As an aside I've started writing an R parser for summary files called SummarySummarizer but maybe that's a bit too tongue in cheek :P. This is a bit different then what you're suggesting, but it's data.frames and classes for accessing and visualizing the data in summary files.

rcedgar commented 4 years ago

FYI I'm hoping to implement a major redesign of the summarizer soon because the current version doesn't handle fragments or multiple families very well. Also, it should clearly flag a big hit (close relative to Cov-2) and it may not do a good job of that right now. I view the file format as subject to change, so maybe we should discuss soon if you're writing code that requires the current format.

ababaian commented 4 years ago

It's adaptable, I will parse the summary file (subject to change) into data.frames and only work with the data.frames. It's fine, I more want to get a feel for the data.

ababaian commented 4 years ago

s3://lovelywater/ Bucket Organization

As per discussion, we have s3://serratus-public/ which is our "work" bucket. To actually host and serve the data we can use another bucket which also will house the data for access via R and the website. This is also an access for sharing the data with collaborators/researchers.

s3://lovelywater/ README.md

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── bam/              # .bam    : Aligned files
├── contig/           # .fasta  : Assembled contig sequences (In Dev)
├── graph/            # .*      : graphs data from assembly (In Dev)
├── R/                # .Rdata  : Tantalus R-objects (per query)
├── summary/          # .summary: Alignment summaries
├── seq/              # Reference sequences used in data-releases      
│   └─── cov3ma/      
├── sra/              # sraRunInfo.csv files and queries for data (per query)
│   └─── README.md    # Search queries used to generate sra lists     
├ INDEX.tsv           # Index file of completed SRA accesions
└ README.md           # This README.md

*Query Set*
├ human               # Human (public) RNAseq and metagenomics
├ mouse               # Mouse RNAseq and metagenomics
├ mamm                # Mammalian RNAseq and metagenomics (excl. mouse/human)
├ vert                # Mammalian RNAseq and metagenomics (excl. mamm/mouse/human)
└ viro                # "Virome" data

## Naming Convention
All folders are flat, with files named `{sra_accession}.{ext}`

For example, the SRA library `SRA123456` processed in the 'viro' query will have the files:
- s3://lovelywater/bam/SRA123456.bam
- s3://lovelywater/contig/SRA123456.fa
- s3://lovelywater/summary/SRA123456.summary
and will be contained in it's experiment objects such as
- s3://lovelywater/R/viro.Rdata

Sub-setting data

S3 is not a filesystem having 1 million+ objects in a folder is OK.

To find and access a sub-set use the index file:

aws s3 cp s3://lovelywater/INDEX.tsv ./
grep "SRR1234" INDEX.tsv > matches
aws s3 cp --recursive -exclude "*" -include "SRR1234*" s3://lovelywater/summary/ ./SRR1234/

rcedgar commented 4 years ago

This scheme is simpler than my earlier proposal, which is a big plus. Understood that S3 can handle very large numbers of objects, but these will often be downloaded to conventional filesystems. I'm an AWS novice, but FWIW I'm finding it cumbersome to deal with very large folders, e.g. the 60k bam's in the last zoonotic run. It's slow to do aws cli things like ls and sync with them if you have a local copy of an S3 folder.

What is fa? Contigs or what? Maybe a more explanatory name?

rcedgar commented 4 years ago

What about setting some criteria where we just say nothing was detected in the dataset? This can be a tsv file explaining why (e.g., too few reads that did not hit blacklists), in which case we don't need to populate bam etc. This would drastically reduce the number of objects / files.

ababaian commented 4 years ago

I think for the summaries, a gzip copy of each experiment would be warrented for exactly this purpose summary.gz/zoo1.summary.gz.

The empty bam files are kind of nice in that they still contain quite a bit of meta-data in the headers. The "problem" is the @SQ entries and I still haven't figured a solution to that which is not complex/destructive.

I do see the problem of functionally empty bam files. There's hacks to exclude them like only grabbing files above a certain size but nothing clean. I don't want to exclude them that's for sure, because that implies it was not done.

rcedgar commented 4 years ago

Strongly disagree with everything in the last comment. The BAM header is almost identical in every bam file for a given batch, offhand the only non-identical thing I can think of is the SRA accession. Everything in the bam header could be reconstructed from SRA accession + meta-data for the batch; the latter should be stored once. The BAM header should be eliminated entirely by stripping the SAM headers from the bowtie2 output and compressing the rest with gzip (or hopefully a more efficient compression method). This would greatly reduce storage requirements for "empty" alignment files. Constructing a BAM file for tools that require it (IGV etc., assemblers) can be automated by using sam reheader if we provide the required header in the directory for each batch, which of course we should do. The convenience of downloading BAM directly from S3 does not justify the huge additional overhead in the S3 storage. A web interface can automate the process of presenting a BAM to the user, if needed.

In my scheme, the filename includes the batch name (e.g., zoo1). The zoo1 directory has meta-data including the bowtie2 command line, name of the FASTA file for the reference, SAM header etc.. The reference FASTA has the sequences and their lengths. Non-trivial data is stored once, elsewhere it is included by a short identifier instead of a bulk copy.

For the large majority of datasets where we find nothing, a tsv file with one line per dataset will fully document what we did and why we rejected the dataset. We can include a few key statistics such as the number of alignments, how many of them were rejected in post-processing (blacklist), etc.

rcedgar commented 4 years ago

This may be off-topic for this issue, but it's related: Conceptually, I view the the goal of the first Serratus pass as identifying a set of candidates for further analysis. The results can be summarized very compactly without any alignments. Ideally, there would be a second pass over the candidates which is much more thorough. For example, I would identify an optimal sub-set of the reference sequences and provide alignments to those rather than the original reference. Currently, bowtie2 scatters hits across many genomes because there are many ties for the best match. The reference for each dataset can be optimized by a process like this: 1. Extract the closest reference genome and align all reads to that. 2. Take unaligned reads, and find the reference sequence which has the best coverage. Repeat this until all reads are aligned or a minimum is reached. This will create a dataset-specific reference with much fewer sequences where the BAM file is immediately suitable for reference-based assembly. Now the BAM/SAM headers are dataset-specific and biologically informative: they tell us which known sequences were needed to cover the virus genome.

ababaian commented 4 years ago

I think we can discuss this in detail at the assembly meeting, hold that thought for a few hours and I'll sketch out what I'm thinking next steps will looks like and we can base the discussion on that.

brietaylor commented 4 years ago

Hey just following along. You're both making some good points and it looks like you're thinking about the right things so I'll just echo what I think's important and maybe add some ideas. I don't really follow the biology details though so if I say something silly feel free to ignore me. :)

I think Robert raises a good point that usability by non-AWS wizzes is important. The folder structure should reflect something that's easy to follow, though S3's unique behaviour may make this challenging.

Something to keep in mind: S3 does not use directories and files, but something that behaves more like a tree. Using / as a directory separator is actually a convention, not a rule. Check this out:

$ aws s3 ls s3://serratus-public/out/200505_zoonotic/bam/SRR999
2020-05-08 19:26:28    1520080 SRR999276.bam
2020-05-08 19:26:20    1736883 SRR999277.bam
2020-05-08 19:23:52    1042769 SRR999278.bam
2020-05-08 19:23:41    1185710 SRR999279.bam
2020-05-08 19:25:04    2127178 SRR999280.bam
2020-05-08 19:26:21    2388900 SRR999281.bam

It's very difficult to run globs the other way around. Something an operation like "get me all the files ending in ".bam" requires listing every file on the path. So if you expect to filter on something like experiment ID or file type often, I'd suggest moving that to the front of the path... OR

I think having a DB file somewhere that includes the file locations and most important parameters would help a lot (path, experiment name, fit quality, number of matches). Then you could just download the DB and select files above a certain quality score in one line (of bash, python, C, R, whatever) and loop over the resulting file list. You can also restructure data in a DB a lot more easily than moving it around on an object store (there is no move, only copy). Listing files on S3 is also really slow. Poking around a DB is a lot less painful, in my opinion.

taltman commented 4 years ago

I'm a SQL junkie, so I'm pretty excited about being able to issue queries over S3 buckets like this:

https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html

So you can avoid having to create a separate DB.

rcedgar commented 4 years ago

DBs are fine as long as they're stored as tsv files because everyone understands them and every "real" database system (SQL) can import them. Binary formats like arb at SILVA are a pain to deal with if you don't have all the dependencies installed already.

ababaian commented 4 years ago

Closed by data-access wiki: https://github.com/ababaian/serratus/wiki/Access-Data-Release

SQL database is an ongoing effort from @victorlin + Dan

ababaian / serratus

Run Meta-data and Organization #83

s3://lovelywater/ Bucket Organization

s3://lovelywater/ README.md

Sub-setting data