Identify characteristic domains in Cov genomes

rcedgar commented 4 years ago

For best taxonomy assignment and tree-building, we would ideally identify these "domains" per #195:

pp1ab: ADRP, nsp5 (3CLpro), nsp12 (RdRp), nsp13 (Hel), nsp14 (ExoN), nsp15 (NendoU) and nsp16 (O-MT)

What does "domain" mean here? This is an imprecise term as far as I'm aware. Should we attempt this, or is it impractical? Are there PFAM HMMs for these domains that would be credible to virologists?

ababaian commented 4 years ago

These would be the functional proteins, i.e. the cleaved products. This is exactly the standard we should hold up the criteria against. Let's do it this way.

rcedgar commented 4 years ago

Um yes, that would be nice, but how do we get the a.a. translations for these things...? Overlapping ORFs, frameshifts, conflicting nomenclature, which PFAM HMM etc.? I've spent a lot of time on this and do not have a solution, can you be more specific?

ababaian commented 4 years ago

pp1ab: Polyprotein 1AB is both ORF1a and ORF1b (frameshift). This will require 6-frame translations and search for each domain (HMM) in that space.

nsp3 (ADRP) ADP-ribose-1″phosphatase (macrodomain); RNA-binding Pfam Link
nsp5 (3CLpro) Pfam Link
nsp12 (RdRp) RNA Polymerase Pfam Link
nsp13 (Hel) Helicase Pfam Link
nsp14 (ExoN) Exonuclease Pfam Link
nsp15 (NendoU) Hexameric uridylate-specific endoribonuclease Pfam Link
nsp16 (O-MT) RNA 2-O-methyltransferase Pfam Link

asl commented 4 years ago

We seem to have all these HMMs bundled with coronaSPAdes.

rcedgar commented 4 years ago

Perfect, thanks @artem! I believe nsp15 (missing in Artem's list) is PF19215.1 in the new Cov-2 set (Pfam-A.SARS-CoV-2.hmm) but not yet on the PFAM web site. I haven't found an HMM for nsp14 (ExoN, Exonuclease), this gene is missing from UniProt as far as I can tell, though I'm not sure if the nsp gene numbers are consistent in different databases.

The 6-frame translation can be done on cli using emboss getorf.

Now we need implements of the next steps: (1) extract the a.a. sequences for those domains from the assembles & the currently known genomes and (2) measure identity with closest known species for the master table. @asl @taltman @rchikhi could one or both of these steps be fit into the assembly or annotation pipeline?

asl commented 4 years ago

We can do the following:

perform 6 frame translation (I’m usually just using transeq for this)
align HMMs
extract the sequences from alignments Probably need to extend alignments to ORF boundaries, so probably we’d start from ORFs, not just translated contigs

rcedgar commented 4 years ago

@asl I think the HMM alignment is exactly what we need for getting to pair-wise a.a. identities for the domains, no need to extend the ORF, we don't need a complete gene for this task and actually I don't think using the complete ORF would work anyway because the proteins are cleaved by proteases so I would assume they're missing start and/or stop codons.

ababaian commented 4 years ago

I may be incorrect but "PF19215.1 in the new Cov-2 set (Pfam-A.SARS-CoV-2.hmm)" is built based on SARS-CoV-2 sequences alone and therefore may not work with distant sequences such as Deltacoronaviruses. We'll need to validate this functions pan-coronavirus

rcedgar commented 4 years ago

You are incorrect. Here is one example from the seed alignment https://www.uniprot.org/uniprot/A0A0U2GMU3, Camel coronavirus. Generally the Pfam-A.SARS-CoV-2.hmm seed alignments are biased towards Betacov and lack some of the genes found only in non-Beta genera, but they have pretty broad Cov-erage :-)

rchikhi commented 4 years ago

As per https://github.com/ababaian/serratus/issues/196#issuecomment-657227780 I'm happy to integrate (1) into the batch pipeline if someone gives me a quick rundown of which commands would be best to implement https://github.com/ababaian/serratus/issues/196#issuecomment-657228170. For integrating point (2) from https://github.com/ababaian/serratus/issues/196#issuecomment-657227780, isn't that BLASTn vs nt?

rcedgar commented 4 years ago

For (1) it's something like this (copy-pasted from one of my own local scripts:

hmmsearch \
  -E 0.01 \
  --tblout tbl.txt \
  --domtblout domtbl.txt \
  --pfamtblout pfamtbl.txt \
  ../../pfam/Pfam-A.SARS-CoV-2.hmm \
  contigsl.fa \
  > hmmsearch.txt

This gives several output files which should all be saved; I can write some python scripts to parse them.

For (2) we should use minimap2, not blastn (I think we discussed this before; we can sidebar on slack if not clear why). The command line I use is like this:

    minimap2 \
      -x map-pb \
      -N 1 \
      -a \
      -t 8 \
      ../minimap2_index/complete_genomes \
      contigs.fasta \
      > contigs_vs_complete_genomes.sam

Command like this needs to be run on three different references from cov5: (a) complete genomes, (b) refseq complete genomes, (c) fragments, i.e. everything which is not a complete genome. --- EDIT --- Actually for (c), we want fragments which are not close to a complete genome. This is because we want to know how many of our assemblies filled out a genome previously known only from a fragment. This needs a threshold, say 97%, for the fragment to be different.

rcedgar commented 4 years ago

@rchikhi In the above example, Pfam-A.SARS-CoV-2.hmm should be replaced by the taxonomically informative set of HMMs given by Artem plus the one extra I found in Pfam-A.SARS-CoV-2.

rchikhi commented 4 years ago

thanks. @asl mentioned that all these HMMs are in coronaspades, so perhaps I could just use those?

taltman commented 4 years ago

I don't understand what are you trying to do here, reinvent VADR? Sounds like more wasted effort. Please explain what you need that VADR doesn't give you.

On July 12, 2020 10:11:53 AM PDT, Rayan Chikhi notifications@github.com wrote:

thanks. @asl mentioned that all these HMMs are in coronaspades, so perhaps I could just use those?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ababaian/serratus/issues/196#issuecomment-657250663

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

rcedgar commented 4 years ago

For each assembly and each GenBank sequence, we need amino acid translations of the following domains: pp1ab: ADRP, nsp5 (3CLpro), nsp12 (RdRp), nsp13 (Hel), nsp14 (ExoN), nsp15 (NendoU) and nsp16 (O-MT). Note that "domain" is a conserved core region within a protein, not a full-length ORF or predicted CDS, so that it will have an approximately global alignment between distant relatives in the family. I guess we assumed VADR wouldn't do this because domains are not annotated in GB, but if it can that would be great and might save some duplicated effort.

ababaian commented 4 years ago

@taltman / @rcedgar I understood that VADR will give us orf1a and orf1b, does it provide the domain coordinates for these models?

rcedgar commented 4 years ago

No clue, I haven't looked at the VADR predictions because I don't know how to validate them, validation looks highly non-trivial to me & needs someone who understands the ridiculously complicated genome better than I do.

rcedgar commented 4 years ago

@ababaian is it your understanding that virologists use "domain" to mean the cleaved protein product? That's not how domain is used in other contexts.

ababaian commented 4 years ago

Domain means domain like you expect. I meant that in this case it roughly corresponds to cleaved protein product.

rcedgar commented 4 years ago

I would expect domains to be less than half of the protein when we have pair-wise a.a. identities <40%, I think it might be a good idea if you reviewed some VADR predictions if you haven't already; you're probably the best qualified to understand what is in there and check whether it looks reasonable.

rcedgar commented 4 years ago

To explain my reasoning: As I read the ICTV web site, these domains are conserved throughout Cov. Pairwise a.a.'s of the conserved domains between highly diverged Cov's are 20 - 40%. In my experience, proteins that are this far diverged are conserved only in hydrophobic core regions and a few critical functional surface loops. The rest diverges in both sequence and structure.

taltman commented 4 years ago

@rcedgar , thanks for the clarification. I will review some of the VADR output files to see if they provide what we need here.

taltman commented 4 years ago

It turns out that for divergent CoVs, the maturation proteins do not get annotated. I opened an issue with Eric regarding forcing VADR to do so even if divergent.

The fastest thing for me to do would be to take the VADR polyprotein calls, and annotate them using the Pfam SARS-CoV-2 models. I can then use that output for extracting the amino acid sequences that @rcedgar and @Pbdas need.

taltman commented 4 years ago

OK, I have this basically working in my dev environment. I am to tired to continue, so tomorrow morning I'll package it nicely in the docker image, test it, and then send it over to @rchikhi for scale-up.

rchikhi commented 4 years ago

Ah nice, so, if I understand correctly, what @taltman will provide me implements the step (1) from https://github.com/ababaian/serratus/issues/196#issuecomment-657249214 but step (2) remains to be implemented, right?

asl commented 4 years ago

For (1) it's something like this (copy-pasted from one of my own local scripts:

Note that HMMs in Pfam-A.SARS-CoV-2.hmm include score cutoffs, so instead of using E-values it is preferable to use GA thresholds via --cut_ga

taltman commented 4 years ago

Yes, implementing it with the --cut_ga option. Thanks!

taltman commented 4 years ago

I've improved @ababaian's table on the Wiki: https://github.com/ababaian/serratus/wiki/Translation-Table-for-ORF-1ab-maturation-protein-models

Please let me know how it can be improved. Thanks!

taltman commented 4 years ago

I've experienced an issue with inconsistent annotation for NSP14. I'm going to copy & paste the relevant parts of me email to Eric Nawrocki at NCBI (author of VADR, Infernal). I've reached out to Pfam and UniProt as well.

When I run VADR on the NCBI reference SARS-CoV-2 genome, it annotates an exoribonuclease. I get the following output line in the .ftr file from running VADR on the SARS-CoV-2 reference genome from NCBI:

1.36  NC_045512.2  29903  PASS  NC_045512  mat_peptide  3'-to-5'_exonuclease            1581   36    +   18040  19620        -  no        -      -        -      -    1    0                 18040..19620:+                 18040..19620:+  -

Here it's described as a 3' -> 5' exonuclease. This report from a viral taxonomy group describe it as a 3' -> 5' exoribonuclease (l believe we're talking about the same thing): https://talk.ictvonline.org/ictv-reports/ictv_9th_report/positive-sense-rna-viruses-2011/w/posrna_viruses/222/coronaviridae

... and below please find the corresponding FASTA file & defline.

When I take the FASTA file that VADR generates for the maturation protein, translate it, and then run it through InterProScan, I get this result:

https://www.ebi.ac.uk/interpro/result/InterProScan/iprscan5-R20200713-071227-0025-51262989-p2m/

If you follow the annotation link to Pfam, you'll see that the entry says that this domain was thought to be the exoribonuclease, but now it is identified as NSP11, with no known function:

https://www.ebi.ac.uk/interpro/entry/pfam/PF06471/

I noticed this same annotation discrepancy between the "COVID-19" release of UniProt, and Pfam, and I've already sent a similar email to the UniProt team.

Everyone seems to be in agreement on the annotation except Pfam, and the Pfam description doesn't seem to correspond to the citation that they list, so I'm now leaning towards thinking that this is an error in the Pfam description. But I'm not a virologist nor a protein expert, so the highest probability is that the error lies with me.

Gene sequence identified by VADR:

>NC_045512.2/mat_peptide.13/18040..19620:+
GCTGAAAATGTAACAGGACTCTTTAAAGATTGTAGTAAGGTAATCACTGGGTTACATCCT
ACACAGGCACCTACACACCTCAGTGTTGACACTAAATTCAAAACTGAAGGTTTATGTGTT
GACATACCTGGCATACCTAAGGACATGACCTATAGAAGACTCATCTCTATGATGGGTTTT
AAAATGAATTATCAAGTTAATGGTTACCCTAACATGTTTATCACCCGCGAAGAAGCTATA
AGACATGTACGTGCATGGATTGGCTTCGATGTCGAGGGGTGTCATGCTACTAGAGAAGCT
GTTGGTACCAATTTACCTTTACAGCTAGGTTTTTCTACAGGTGTTAACCTAGTTGCTGTA
CCTACAGGTTATGTTGATACACCTAATAATACAGATTTTTCCAGAGTTAGTGCTAAACCA
CCGCCTGGAGATCAATTTAAACACCTCATACCACTTATGTACAAAGGACTTCCTTGGAAT
GTAGTGCGTATAAAGATTGTACAAATGTTAAGTGACACACTTAAAAATCTCTCTGACAGA
GTCGTATTTGTCTTATGGGCACATGGCTTTGAGTTGACATCTATGAAGTATTTTGTGAAA
ATAGGACCTGAGCGCACCTGTTGTCTATGTGATAGACGTGCCACATGCTTTTCCACTGCT
TCAGACACTTATGCCTGTTGGCATCATTCTATTGGATTTGATTACGTCTATAATCCGTTT
ATGATTGATGTTCAACAATGGGGTTTTACAGGTAACCTACAAAGCAACCATGATCTGTAT
TGTCAAGTCCATGGTAATGCACATGTAGCTAGTTGTGATGCAATCATGACTAGGTGTCTA
GCTGTCCACGAGTGCTTTGTTAAGCGTGTTGACTGGACTATTGAATATCCTATAATTGGT
GATGAACTGAAGATTAATGCGGCTTGTAGAAAGGTTCAACACATGGTTGTTAAAGCTGCA
TTATTAGCAGACAAATTCCCAGTTCTTCACGACATTGGTAACCCTAAAGCTATTAAGTGT
GTACCTCAAGCTGATGTAGAATGGAAGTTCTATGATGCACAGCCTTGTAGTGACAAAGCT
TATAAAATAGAAGAATTATTCTATTCTTATGCCACACATTCTGACAAATTCACAGATGGT
GTATGCCTATTTTGGAATTGCAATGTCGATAGATATCCTGCTAATTCCATTGTTTGTAGA
TTTGACACTAGAGTGCTATCTAACCTTAACTTGCCTGGTTGTGATGGTGGCAGTTTGTAT
GTAAATAAACATGCATTCCACACACCAGCTTTTGATAAAAGTGCTTTTGTTAATTTAAAA
CAATTACCATTTTTCTATTACTCTGACAGTCCATGTGAGTCTCATGGAAAACAAGTAGTG
TCAGATATAGATTATGTACCACTAAAGTCTGCTACGTGTATAACACGTTGCAATTTAGGT
GGTGCTGTCTGTAGACATCATGCTAATGAGTACAGATTGTATCTCGATGCTTATAACATG
ATGATCTCAGCTGGCTTTAGCTTGTGGGTTTACAAACAATTTGATACTTATAACCTCTGG
AACACTTTTACAAGACTTCAG

rchikhi commented 4 years ago

Command like this needs to be run on three different references from cov5: (a) complete genomes, (b) refseq complete genomes, (c) fragments, i.e. everything which is not a complete genome. --- EDIT --- Actually for (c), we want fragments which are not close to a complete genome. This is because we want to know how many of our assemblies filled out a genome previously known only from a fragment. This needs a threshold, say 97%, for the fragment to be different.

Done. cov5.fa had:

9310 complete genomes, including
- 9250 GenBank
- 60 RefSeq
34306 fragments (edited)

And after 97% clustering (and dropping fragment too similar to refs):

819 complete genomes (RefSeq+GenBank)
3509 fragments

Data is here: https://serratus-public.s3.amazonaws.com/seq/cov5/cov5.complete.uclust0.99.fa https://serratus-public.s3.amazonaws.com/seq/cov5/cov5.frag.97p.uclust0.99.fa

Let me know quickly if something looks off, because I'll run analysis for the master table based on it.

My commands were:

python split_cov5.py
cat cov5.rs.fa cov5.gb.fa > cov5.complete.fa
./usearch11.0.667_i86linux32  -cluster_fast cov5.complete.fa -id 0.99 -centroids cov5.complete.uclust0.99.fa
\time minimap2 -c -t 16 cov5.complete.uclust0.99.fa cov5.frag.fa > cov5.frag.fa.paf
python paf_97p_drop.py
./usearch11.0.667_i86linux32  -cluster_fast cov5.frag.97p.fa -id 0.97 -centroids cov5.frag.97p.uclust0.99.fa

with scripts available here.

Intermediate res: the 3509 fragments were obtained by clustering 10444 fragments which didn't not match the ref at >=97% id AND >= 90% alignment length (out of the original 34306 fragments).

rcedgar commented 4 years ago

For the master table, complete genomes only, don't worry about fragments, they won't be included.

rchikhi commented 4 years ago

according to https://github.com/ababaian/serratus/issues/193#issuecomment-657126752, they are?

rcedgar commented 4 years ago

Yes, we need a database of fragments so you can align assemblies to it, you're right.

Note we need a tsv with the clustering results so that we know which cluster each sequence belongs to, I think that's missing from your uclust command line, the easiest way to do this is to use the -uc option, e.g. -uc cov5.complete.uclust0.99.uc.

rcedgar commented 4 years ago

For minimap2, I don't see your -x option and it doesn't seem to document what the default is, I use -x map-pb. If you generally get long alignments covering the whole genome it's probably fine.

rchikhi commented 4 years ago

For minimap2, I don't see your -x option and it doesn't seem to document what the default is, I use -x map-pb. If you generally get long alignments covering the whole genome it's probably fine.

default is k=15 without homopolymer compression. It seems to correspond exactly to -x map-ont. (as per the code, which sets k to 15 but k was already 15 by default.) -x may-pb enables homopolymer compression and sets k=19 which is a different sensitivity trade-off. (and does nothing else) It's really not clear if we should use map-pb when mapping sequences which aren't pacbio reads.

rchikhi commented 4 years ago

Note we need a tsv with the clustering results so that we know which cluster each sequence belongs to, I think that's missing from your uclust command line, the easiest way to do this is to use the -uc option, e.g. -uc cov5.complete.uclust0.99.uc.

Done, uc's are in https://serratus-public.s3.amazonaws.com/seq/cov5/cov5.complete.uclust0.99.uc https://serratus-public.s3.amazonaws.com/seq/cov5/cov5.frag.97p.uclust0.99.uc

rcedgar commented 4 years ago

Seems this issue is fully resolved, tricky one, great teamwork!

ababaian / serratus

Identify characteristic domains in Cov genomes #196