Closed taltman closed 3 years ago
Nice.
Quick note: this seems to be reference-based scaffolding. The canonicalize_contigs.sh
script is finding a correct order and orientation but it concatenates contigs, making the result not necessarily accurate at the base level. (i.e. contigs may actually be farther apart in the genome and should be separated by N's; or alternatively, concatenating contigs that were overlapping would actually create an artificial duplication of length <= (k-1) where k is the assembly final kmer size). Maybe this is fine for annotation, but for distribution it would be better to instead provide a multifasta of contigs with metadata regarding their order+orientation.
I concatenate the individual FASTA entries into a multi-FASTA file; I'm not concatenating the sequences themselves. Does that make sense?
Might be best with an example:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/canonicalization$ egrep "^>" SRR8389793/transeq/canonical.fna
>NODE_9_length_984_cluster_9_candidate_1_domains_2 Reversed:
>NODE_18_length_279_cluster_18_candidate_1_domains_1
>NODE_19_length_275_cluster_19_candidate_1_domains_1
>NODE_6_length_1483_cluster_6_candidate_1_domains_2 Reversed:
>NODE_8_length_1253_cluster_8_candidate_1_domains_2 Reversed:
>NODE_5_length_10986_cluster_5_candidate_1_domains_2
>NODE_16_length_275_cluster_16_candidate_1_domains_1
So you can see from the NODE number how the order has been modified, and from the "Reversed:" string, which contigs had their reverse-complement taken.
I concatenate the individual FASTA entries into a multi-FASTA file; I'm not concatenating the sequences themselves. Does that make sense?
Ahhh I had missed that! Then in that sense I have no complaint. So VADR was sensitive to the order and orientation of contigs in the original, unsorted multifasta file, huh?
So, from the assembly standpoint here is what we could do:
Is this resolved? I believe so?
yep, @taltman's canonicalize_contigs.sh
script was used to make the genome organization figure and it worked well!
Proposal for how to order and orient the contigs in our assemblies.
While working on two problems (annotating fragmented assemblies and annotating the whole assembly with the Pfam CoV models), I realized that one could help the other.
VADR expects that its input should be in a 'sensible' orientation, with the contigs ordered and reverse-complemented as needed, such that the orientation corresponds to the RefSeq models it uses for the annotation.
Beyond helping VADR do its job, the Law of Least Surprise would dictate that people looking at our assemblies would expect it to be in the "canonical" CoV ordering of domains, so for example, NSP1 would be near the start of the first contig, and domains like CoV_M would be towards the end of the last contig (see the ICTV website for their description of what is canonical). In fact, NCBI uses VADR for screening submissions, and it will "fail" a submission if it is in the wrong orientation.
It wasn't clear to me how to do this ordering, until I started doing the full-contig 6-frame translation for the Pfam annotations that @rcedgar needs. I decided to take the Pfam annotations, and use them to figure out the ordering and the orientation.
The script for doing so is here:
https://bitbucket.org/tomeraltman/darth/src/master/src/canonicalize_contigs.sh
It is designed to work within the DARTH container, but as long as you have
hmmer3
and theemboss
package installed, you should be good to go.The documentation of the inputs & outputs of the script, and the pseudo-code can be found in the README.md:
https://bitbucket.org/tomeraltman/darth/src/master/
I am folding this script into DARTH, but I recommend that we also use a canonicalized assembly order for the assemblies that we end up submitting.
Currently the model used for ordering is the SARS-CoV-2 RefSeq genome, but I can easily create models for all of the GenBank reference sequences. For sequence with no close hit, we can use the ordering described in the ICTV website for the general domain structure of CoVs.
I welcome constructive feedback.