ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
251 stars 33 forks source link

Identify complete assemblies #174

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

Cat A assemblies (putative complete) are identified as single contigs >25kb. These may not be complete genomes because there are several examples of partial genomes which are >30kb. Rodent Covs appear to be especially long, e.g. these rat Cov ORF1a CDSs: KY370051.1, KY370049.1 and this rat Cov "partial genome" KF850449.2. There are many non-rodent examples in the range 25 - 30kb.

How to classify assemblies as complete / partial?

ababaian commented 4 years ago

There are essentially three criteria which I propose to define "complete genome"

1) > 25 kb contig

2) At the 5' end we should be able to identify a 'Leader Sequence', possibly via a nucleotide HMM of the 5' UTR from known CoV complete genomes. This would be upto the first ATG or the first 100 nucleotides.

3) At the 3' end is a bit tricky due to poly-A trimming, but this can be validated if we can confirm that reads are trimmed at this point due to poly-A repeats.

rcedgar commented 4 years ago
  1. I could try to do this one. @ababaian can you confirm we need this for GB submission or our own annotation?

  2. Why would the assembler clip reads? This is not mapping. Do you expect a long pure poly-A exactly at the 3' end? Most of the Cat A's don't have this; some of them have messy ends with some poly A at or near the end, e.g.

>SRR9763527.coronaspades.NODE_1_length_27854_cluster_1_candidate_1_domains_23
...
TAGAGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
CAAAAAAAACAAAAAAACGGATAGTCTTTTCCTATGGAAAACTATTTTTCA
ababaian commented 4 years ago

Sorry clip should be trim, this is a QC step in assembly. I've edited the comment above.

Edit: I don't know if we need this for GB or not, but for our own metrics this will be important as we need to classify 'complete' versus new

rcedgar commented 4 years ago

@ababaian not clear where we are on this. Are there open issues here? Who to assign?

ababaian commented 4 years ago

Well, (1) is trivially solved already. You've volunteered to solve problem (2), and (3) we'll need someone else to tackle.

I think if we get (2) then this can be rolled into the annotation pipeline. (3) may be tricker to implement

rcedgar commented 4 years ago

Gotcha. I wasn't clear if (2) was necessary, I will assign myself.

rcedgar commented 4 years ago

Closing, replaced by #175 and #176.