Closed rcedgar closed 4 years ago
There are essentially three criteria which I propose to define "complete genome"
1) > 25 kb contig
2) At the 5' end we should be able to identify a 'Leader Sequence', possibly via a nucleotide HMM of the 5' UTR from known CoV complete genomes. This would be upto the first ATG or the first 100 nucleotides.
3) At the 3' end is a bit tricky due to poly-A trimming, but this can be validated if we can confirm that reads are trimmed at this point due to poly-A repeats.
I could try to do this one. @ababaian can you confirm we need this for GB submission or our own annotation?
Why would the assembler clip reads? This is not mapping. Do you expect a long pure poly-A exactly at the 3' end? Most of the Cat A's don't have this; some of them have messy ends with some poly A at or near the end, e.g.
>SRR9763527.coronaspades.NODE_1_length_27854_cluster_1_candidate_1_domains_23
...
TAGAGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
CAAAAAAAACAAAAAAACGGATAGTCTTTTCCTATGGAAAACTATTTTTCA
Sorry clip should be trim, this is a QC step in assembly. I've edited the comment above.
Edit: I don't know if we need this for GB or not, but for our own metrics this will be important as we need to classify 'complete' versus new
@ababaian not clear where we are on this. Are there open issues here? Who to assign?
Well, (1) is trivially solved already. You've volunteered to solve problem (2), and (3) we'll need someone else to tackle.
I think if we get (2) then this can be rolled into the annotation pipeline. (3) may be tricker to implement
Gotcha. I wasn't clear if (2) was necessary, I will assign myself.
Closing, replaced by #175 and #176.
Cat A assemblies (putative complete) are identified as single contigs >25kb. These may not be complete genomes because there are several examples of partial genomes which are >30kb. Rodent Covs appear to be especially long, e.g. these rat Cov ORF1a CDSs: KY370051.1, KY370049.1 and this rat Cov "partial genome" KF850449.2. There are many non-rodent examples in the range 25 - 30kb.
How to classify assemblies as complete / partial?