ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
253 stars 33 forks source link

Tracking and processing of putative detections #78

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

We got some good hits from the first run, including these groups:

GroupA: SRR10951654-655 GroupB: SRR10829951-958

We need a system for tracking, processing and annotation of putative detections, which already looks to be quite complicated.

Most basic is simply a list of SRA accessions with a few key meta-data fields, which might include these:

(+) Accession of the most similar full-length genome in Genbank and (+) average %id. This requires post-processing of the dataset because the mapping reference is clustered at 99% id and therefore may not contain the most similar genome.

(+) Coverage across the genome and (+) average read depth. The "pan_genome" record in the summarizer output includes this and more, I would suggest including this record.

(+) Group identifier corresponding to GroupA and GroupB above. Runs within each group A and B above look very similar, though A and B look different. Looks like the runs in each group contain the same virus, in which case they could be combined to get greater read depth, quite likely enough to get a good de novo assembly in the case of GroupA which otherwise has very low coverage (~3x). Finding groups could perhaps be automated by clustering of summary reports or something like that, but short-term will surely be done manually.

(+) Links to assemblies on S3. Short-term at least, we will probably generate multiple assemblies for each dataset (and/or group) using different methods. We need to define a directory structure on S3 so that assemblies for a given SRA or group can be found and compared. For the detection list, this could be a sub-directory name, which would be either an SRA accession or a group identifier. Sub-directories under this would be named according to some convention.

(+) Strategy. I think we will need a few different pipelines for processing a detection depending on read depth, divergence and perhaps other factors. "Strategy" is the name of the post-processing pipeline to be used for this detection. Short term, this is a placeholder because we don't have any post-processing pipeline ready to run. Medium term, it will be assigned manually. Longer term, it might be possible to automate.

(+) Comments (up to a few words).

(+) Link to discussion page (for more extensive comments).

This looks well-suited to a shared spreadsheet, e.g. a Google Sheet.

rcedgar commented 4 years ago

Tracking is de facto implemented by the accumulated summary files, closing issue.