Top candidates from first Zoonotic run for assembler testing

rcedgar commented 4 years ago

Here are the top candidates I found in a preliminary analysis:

zoonotic_candidates.tsv

Fields are:

SRA accession.
Genbank accession of most similar full-length genome in 99% redundancy subset.
Reference genome length
Avg identity of reads with genome.
Description of reference genome.
Taxonomy of reference genome in Genbank record.

rcedgar commented 4 years ago

Here are draft reference-based assemblies for the 136 datasets listed above:

https://drive5.com/tmp/ass.tz

There are three directories:

ass_fa/ FASTA format ass_aln/ Human-readable alignment generated by the assembler ass_blast/ Blast alignment of FASTA to reference as a check on my alignment

The top row of the human-readable alignment looks like this:

###OOOOOOoooooo:::::::::......... .........::::::oOOOOOOO

The symbols indicate read depth:

. One read : Two reads o 3-4 reads O 4-7 reads # 8+ reads

The example above is typical of what I've seen in several assemblies. Notice how there is strong fluctuation in depth from at least 8 reads down to 1 or none. Hence my lobbying on the call today to provide a per-base quality metric for the assemblies because downstream analyses typically assume that a base call is good.

rcedgar commented 4 years ago

The above assemblies were generated by a fully automated protocol as a proof of concept that this can be implemented in an HPC framework in reasonable time. The assembler is a very quick and dirty hack, but the contigs should be good enough to serve as a sanity check for de novo efforts. De novo assemblies will also be a check on my reference assemblies -- comparisons solicited!

ababaian commented 4 years ago

Robert,

I've been looking at more high divergence hits, take a look at ERR2756788 and let me know what you think of this. Screenshot from 2020-05-15 19-21-17

EDIT: That bat (I've named him Frank btw) has a friend, ERR3569452 is the same virus (same variants in the alignment)

ababaian commented 4 years ago

I've been using a plot of Divergence vs. Hits to focus on divergent hits of a certain hit threshold (when you see it you'll understand why). It required removing the 'noise' with respect to blacklisted hits.

I was going through accessions/library which are <85% divergence and >100 read hits.

Pre-filtering Scatter Plot

pctID_v_hits_pre-filter.pdf

Post-filtering Scatter Plot

pctID_v_hits_post-filter.pdf

There's still a bit more work to do here with clean-up, but if ERR2756788 is in fact a divergent CoV, then we'll need to think carefully about scoring based on the distribution of hits in that this library.

See SRA Entry

rcedgar commented 4 years ago

This looks like a good hit to me. Cov2m + new summarizer would find it. If there are many like this, that supports your argument for including fragments in the reference. I don't understand your scatterplots. What I would like to see is a histogram of dataset counts binned by identity (100, 99, .... 80) to a reference with full-length only genomes (FLOM). If this histogram is flat or has a fat tail, then there is a strong case for including metagenomic fragments. If it has a thin tail, then fragments may be more trouble than they're worth.

ababaian commented 4 years ago

That's what the scatter plot is showing for cov2m. Remember how in the simulated divergence we had the drop map% as a function of divergence. Here we see a drop in hits (even in a very course measurement of total hits) following divergence. Therefore divergent + many hits (i.e. the outliers) are where we find novel viruses.

rcedgar commented 4 years ago

Still not really following, but if you are seeing a fat tail with cov2m then that is a very promising result because it suggests we can find many novel highly diverged Cov's (many = more than I was expecting, anyway). I guess that's what you're trying to say & I'm having difficulty following the details.

Assuming this result, I totally agree we should push sensitivity by using reference fragments, but then the difficulty of implementing an analysis pipeline and generating a high-quality data repository in 6 weeks looks daunting / borderline impossible because we need 1. automated detection with a fragmented reference followed by 2. HPC de novo assembly against hosts with varying distance from known genomes, which from my perspective looks like a non-trivial research problem which is not solvable in the time available. Maybe the next priority is to focus on validating & documenting your result & seeking more funding from fast-moving entrepreneurial / non-traditional grants to buy the additional time needed?

rcedgar commented 4 years ago

@ababaian can you explain in more detail how you made the scatterplot/histogram and how you interpret it, preferably in prose rather than pointing to a notebook? This analysis looks tricky to me. The naive way is to parse pan_genome lines in s3://serratus-public/out/200505_zoonotic/summary/, but these look to be worthless (my responsibility on that point, of course). With luck, the next summarizer might enable this analysis, but I'm doubtful, I'm expecting I'll need to use the tinyhit files pending further refinement of the summarizer. If the new summary reports are better they'll still need some careful post-processing to measure coverage on Cov only and check for bad (black) regions which cause spurious coverage. Based on my Cov-2 multiple alignment, I expect hits at 80% id to have coverage only of one well-conserved gene, in which case I think we'll need to do some manual analysis to convince ourselves that low-coverage / low-identity hits are true positives and rule out difficult cases such as Cov's in the reference set which have incorporated mammal genes. IMO we need to understand a bit more about the phylogenetics of the virus before we can automate detection at high divergence, doubly so if we use a fragmented reference.

ababaian commented 4 years ago

So I do exactly that. I parse every single hit in the .summary files into R. I can then graph (and therefore group) each accession hit. The scatter plot is pctid vs hits. The idea being that divergent viruses will have poor read identity but relatively many hits. The "step wise" drop in coverage/identity I /think/ is leaking reads from one reference into another one. For instance with 100,000 SARS-CoV-2 reads, some will leak into MERS at a fixed rate.

First pass was a bit of a wash because of the 'junk' hits, but by focusing on the quadrant <85% divergence and >100 read hits, I identify them quickly.

I then added blacklist to filter out this junk and I was left with a few libraries that had divergent and plentyful matches.

I haven't wrapped my head around how to automate a process like this, but it will help refine the genome very quickly and hopefully with a bit more work can give us some 'criteria' for defining putative CoV+ libraries to assembly.

rcedgar commented 4 years ago

Excellent, this is exactly the kind of approach I was thinking of. I was focusing on higher identities first to make sure I could assemble easier cases. You're digging the tunnel from the other end; hopefully we will meet in the middle.

ababaian commented 4 years ago

See also: SRR3292629 this is a 'clean' hit which should get picked up.

rcedgar commented 4 years ago

Agreed. Both ERR2756788 (score=86) and SRR3292629 (score=100) are in my top zoonotic candidates from which I picked some representatives as the assembler test cases.

rcedgar commented 4 years ago

@ababaian said: "For instance with 100,000 SARS-CoV-2 reads, some will leak into MERS at a fixed rate." Probably due to bowtie2's randomized tie-breaking:

"Bowtie 2’s search for alignments for a given read is “randomized.” That is, when Bowtie 2 encounters a set of equally-good choices, it uses a pseudo-random number to choose. For example, if Bowtie 2 discovers a set of 3 equally-good alignments and wants to decide which to report, it picks a pseudo-random integer 0, 1 or 2 and reports the corresponding alignment. Arbitrary choices can crop up at various points during alignment."

Bowtie2 manual

ababaian commented 4 years ago

We knew this would happen right. This is why we went to a common pan-genome.

Also found a sick kitty :'( ~SRR6662917~

Edit: Crossed my wires. SRR7287114 + SRR7287110 and it comes up well in the summarizer :)

rcedgar commented 4 years ago

Ferrets are kitties? My latest summarizer thinks this one is noise due to unmasked black regions.

taltman commented 4 years ago

What are the next steps for this topic? Can we refine this into a set of deliverables?

rcedgar commented 4 years ago

The purpose of the issue was to announce deliverables :-) We are de facto using Github as a sort of slack and we're making up protocols for announcements and discussions as we go along. Maybe announcements are better suited to slack, but FWIW I am a novice slack user and so far I strongly dislike it and don't understand how to use it productively. My learning curve is exploding exponentially right now and I'm trying to flatten the curve :-) Back on topic, I will close the issue because it seems to have served it's non-issue purpose.

ababaian commented 4 years ago

This is an inventory of top hits, I think we've gone through the zoonotic1 dataset now and we need to centralize this list of hits like Robert has done, add the few he may have missed and this is something we can hand off to the assembly team. I propose creating a wiki page holding this data since it is relevant long term.

We don't have a central database to hold meta-data like hits, contigs etc... yet.

rcedgar commented 4 years ago

I am closing this issue and punting to @ababaian to create a Wiki / open an issue for Wiki creation.

ababaian commented 4 years ago

K I'll add this to my TODO and use this issue then :P

taltman commented 4 years ago

Hi @ababaian , I made a stub for this on the Serratus Assembly Wiki page:

https://github.com/ababaian/serratus/wiki/Serratus-Assembly#candidate-samples-for-discovery-and-pipeline-development

taltman commented 4 years ago

@rcedgar I hear you regarding the learning curve. Here's what has been useful break-down for me coordinating work on GitHub on other projects:

GitHub issues: use for agile development artifacts, like issues, bugs, and stories (i.e., feature requests). There should be a very well-defined end-product for these, ideally just one. If it isn't clear, then it needs to be refined or split out into parts until each part is crystal clear.
GitHub wiki: Place where persistent documentation and information lives. It can be edited both by command line and via the Web, is a full-blown Git repo, and allows for cross-linking across pages (unlike Markdown in the code repo). Since it allows for cross linking and has a table of contents, easier to find what you are looking for rather than information being embedded in Issues.
Slack/Email/IRC/other chat: ephemeral, informal communication. Any concrete ideas should be captured in GitHub Issues, and any interesting information/documentation/references should be captured in the Wiki.

My $0.02!

rcedgar commented 4 years ago

"Place where persistent documentation and information lives." Conflicts with doc/ in the repo, which in Artem's convention is notebooks? This stuff should go in CONTRIBUTING. It's not easy to grasp, but I can see the need for some organization here. I don't know where to put things or look for things so there is wasted and duplicated effort, I'm sure I'm not the only one.

taltman commented 4 years ago

@ababaian What is the scope of the doc/ directory in the 'serratus' repo? On my other projects, only code-specific documentation goes in the source code repo in the doc or docs top-level directory. But higher-level project documentation would go in the Wiki.

@rcedgar The Wiki is also a Git repo, FYI.

ababaian commented 4 years ago

Serratus Documentation

1) notebook/: Experimental notes, and run notes. Independent of the true codebase.

2) wiki/: A git repo containing documentation and long-term scientific notes. Descriptions of where data is, how to access it and such. Almost all of CONTRIBUTING.md needs to be migrated to the wiki if it has not already been.

3) doc/ (deprecated) Currently some var documentation. Once the wiki is set-up we can remove this as it's redundant and less accessible than the wiki. Documentation for code will live in src/(module_name)/README.md so it's easily readable per module in github.

4) issues: This is a task or goal oriented set of work which needs to be completed. Communication and/or debate when making a critical decision can and should go into an issue as long as there is a resolvable answer that can close the issue.

3) chat : informal comms.

ababaian commented 4 years ago

Is there anything addition regarding this issue that is needed or are we good to close this?

rcedgar commented 4 years ago

Good to close. As owner, I'm closing.

ababaian / serratus