ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Canonicalize assembly ordering and orientation #216

Closed taltman closed 3 years ago

taltman commented 3 years ago

Proposal for how to order and orient the contigs in our assemblies.

While working on two problems (annotating fragmented assemblies and annotating the whole assembly with the Pfam CoV models), I realized that one could help the other.

VADR expects that its input should be in a 'sensible' orientation, with the contigs ordered and reverse-complemented as needed, such that the orientation corresponds to the RefSeq models it uses for the annotation.

Beyond helping VADR do its job, the Law of Least Surprise would dictate that people looking at our assemblies would expect it to be in the "canonical" CoV ordering of domains, so for example, NSP1 would be near the start of the first contig, and domains like CoV_M would be towards the end of the last contig (see the ICTV website for their description of what is canonical). In fact, NCBI uses VADR for screening submissions, and it will "fail" a submission if it is in the wrong orientation.

It wasn't clear to me how to do this ordering, until I started doing the full-contig 6-frame translation for the Pfam annotations that @rcedgar needs. I decided to take the Pfam annotations, and use them to figure out the ordering and the orientation.

The script for doing so is here: https://bitbucket.org/tomeraltman/darth/src/master/src/canonicalize_contigs.sh

It is designed to work within the DARTH container, but as long as you have hmmer3 and the emboss package installed, you should be good to go.

The documentation of the inputs & outputs of the script, and the pseudo-code can be found in the README.md: https://bitbucket.org/tomeraltman/darth/src/master/

I am folding this script into DARTH, but I recommend that we also use a canonicalized assembly order for the assemblies that we end up submitting.

Currently the model used for ordering is the SARS-CoV-2 RefSeq genome, but I can easily create models for all of the GenBank reference sequences. For sequence with no close hit, we can use the ordering described in the ICTV website for the general domain structure of CoVs.

I welcome constructive feedback.

rchikhi commented 3 years ago

Nice.

Quick note: this seems to be reference-based scaffolding. The canonicalize_contigs.sh script is finding a correct order and orientation but it concatenates contigs, making the result not necessarily accurate at the base level. (i.e. contigs may actually be farther apart in the genome and should be separated by N's; or alternatively, concatenating contigs that were overlapping would actually create an artificial duplication of length <= (k-1) where k is the assembly final kmer size). Maybe this is fine for annotation, but for distribution it would be better to instead provide a multifasta of contigs with metadata regarding their order+orientation.

taltman commented 3 years ago

I concatenate the individual FASTA entries into a multi-FASTA file; I'm not concatenating the sequences themselves. Does that make sense?

taltman commented 3 years ago

Might be best with an example:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/canonicalization$ egrep "^>" SRR8389793/transeq/canonical.fna
>NODE_9_length_984_cluster_9_candidate_1_domains_2 Reversed:
>NODE_18_length_279_cluster_18_candidate_1_domains_1
>NODE_19_length_275_cluster_19_candidate_1_domains_1
>NODE_6_length_1483_cluster_6_candidate_1_domains_2 Reversed:
>NODE_8_length_1253_cluster_8_candidate_1_domains_2 Reversed:
>NODE_5_length_10986_cluster_5_candidate_1_domains_2
>NODE_16_length_275_cluster_16_candidate_1_domains_1
taltman commented 3 years ago

So you can see from the NODE number how the order has been modified, and from the "Reversed:" string, which contigs had their reverse-complement taken.

rchikhi commented 3 years ago

I concatenate the individual FASTA entries into a multi-FASTA file; I'm not concatenating the sequences themselves. Does that make sense?

Ahhh I had missed that! Then in that sense I have no complaint. So VADR was sensitive to the order and orientation of contigs in the original, unsorted multifasta file, huh?

asl commented 3 years ago

So, from the assembly standpoint here is what we could do:

ababaian commented 3 years ago

Is this resolved? I believe so?

rchikhi commented 3 years ago

yep, @taltman's canonicalize_contigs.sh script was used to make the genome organization figure and it worked well!