Closed rcedgar closed 4 years ago
These would be the functional proteins, i.e. the cleaved products. This is exactly the standard we should hold up the criteria against. Let's do it this way.
Um yes, that would be nice, but how do we get the a.a. translations for these things...? Overlapping ORFs, frameshifts, conflicting nomenclature, which PFAM HMM etc.? I've spent a lot of time on this and do not have a solution, can you be more specific?
pp1ab: Polyprotein 1AB is both ORF1a and ORF1b (frameshift). This will require 6-frame translations and search for each domain (HMM) in that space.
nsp3 (ADRP) ADP-ribose-1″phosphatase (macrodomain); RNA-binding Pfam Link
nsp5 (3CLpro) Pfam Link
nsp12 (RdRp) RNA Polymerase Pfam Link
nsp13 (Hel) Helicase Pfam Link
nsp14 (ExoN) Exonuclease Pfam Link
nsp15 (NendoU) Hexameric uridylate-specific endoribonuclease Pfam Link
nsp16 (O-MT) RNA 2-O-methyltransferase Pfam Link
We seem to have all these HMMs bundled with coronaSPAdes.
Perfect, thanks @artem! I believe nsp15 (missing in Artem's list) is PF19215.1 in the new Cov-2 set (Pfam-A.SARS-CoV-2.hmm) but not yet on the PFAM web site. I haven't found an HMM for nsp14 (ExoN, Exonuclease), this gene is missing from UniProt as far as I can tell, though I'm not sure if the nsp gene numbers are consistent in different databases.
The 6-frame translation can be done on cli using emboss getorf.
Now we need implements of the next steps: (1) extract the a.a. sequences for those domains from the assembles & the currently known genomes and (2) measure identity with closest known species for the master table. @asl @taltman @rchikhi could one or both of these steps be fit into the assembly or annotation pipeline?
We can do the following:
@asl I think the HMM alignment is exactly what we need for getting to pair-wise a.a. identities for the domains, no need to extend the ORF, we don't need a complete gene for this task and actually I don't think using the complete ORF would work anyway because the proteins are cleaved by proteases so I would assume they're missing start and/or stop codons.
I may be incorrect but "PF19215.1 in the new Cov-2 set (Pfam-A.SARS-CoV-2.hmm)" is built based on SARS-CoV-2 sequences alone and therefore may not work with distant sequences such as Deltacoronaviruses. We'll need to validate this functions pan-coronavirus
You are incorrect. Here is one example from the seed alignment https://www.uniprot.org/uniprot/A0A0U2GMU3, Camel coronavirus. Generally the Pfam-A.SARS-CoV-2.hmm seed alignments are biased towards Betacov and lack some of the genes found only in non-Beta genera, but they have pretty broad Cov-erage :-)
As per https://github.com/ababaian/serratus/issues/196#issuecomment-657227780 I'm happy to integrate (1) into the batch pipeline if someone gives me a quick rundown of which commands would be best to implement https://github.com/ababaian/serratus/issues/196#issuecomment-657228170. For integrating point (2) from https://github.com/ababaian/serratus/issues/196#issuecomment-657227780, isn't that BLASTn vs nt?
For (1) it's something like this (copy-pasted from one of my own local scripts:
hmmsearch \
-E 0.01 \
--tblout tbl.txt \
--domtblout domtbl.txt \
--pfamtblout pfamtbl.txt \
../../pfam/Pfam-A.SARS-CoV-2.hmm \
contigsl.fa \
> hmmsearch.txt
This gives several output files which should all be saved; I can write some python scripts to parse them.
For (2) we should use minimap2, not blastn (I think we discussed this before; we can sidebar on slack if not clear why). The command line I use is like this:
minimap2 \
-x map-pb \
-N 1 \
-a \
-t 8 \
../minimap2_index/complete_genomes \
contigs.fasta \
> contigs_vs_complete_genomes.sam
Command like this needs to be run on three different references from cov5: (a) complete genomes, (b) refseq complete genomes, (c) fragments, i.e. everything which is not a complete genome. --- EDIT --- Actually for (c), we want fragments which are not close to a complete genome. This is because we want to know how many of our assemblies filled out a genome previously known only from a fragment. This needs a threshold, say 97%, for the fragment to be different.
@rchikhi In the above example, Pfam-A.SARS-CoV-2.hmm should be replaced by the taxonomically informative set of HMMs given by Artem plus the one extra I found in Pfam-A.SARS-CoV-2.
thanks. @asl mentioned that all these HMMs are in coronaspades, so perhaps I could just use those?
I don't understand what are you trying to do here, reinvent VADR? Sounds like more wasted effort. Please explain what you need that VADR doesn't give you.
On July 12, 2020 10:11:53 AM PDT, Rayan Chikhi notifications@github.com wrote:
thanks. @asl mentioned that all these HMMs are in coronaspades, so perhaps I could just use those?
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ababaian/serratus/issues/196#issuecomment-657250663
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
For each assembly and each GenBank sequence, we need amino acid translations of the following domains: pp1ab: ADRP, nsp5 (3CLpro), nsp12 (RdRp), nsp13 (Hel), nsp14 (ExoN), nsp15 (NendoU) and nsp16 (O-MT). Note that "domain" is a conserved core region within a protein, not a full-length ORF or predicted CDS, so that it will have an approximately global alignment between distant relatives in the family. I guess we assumed VADR wouldn't do this because domains are not annotated in GB, but if it can that would be great and might save some duplicated effort.
@taltman / @rcedgar I understood that VADR will give us orf1a and orf1b, does it provide the domain coordinates for these models?
No clue, I haven't looked at the VADR predictions because I don't know how to validate them, validation looks highly non-trivial to me & needs someone who understands the ridiculously complicated genome better than I do.
@ababaian is it your understanding that virologists use "domain" to mean the cleaved protein product? That's not how domain is used in other contexts.
Domain means domain like you expect. I meant that in this case it roughly corresponds to cleaved protein product.
I would expect domains to be less than half of the protein when we have pair-wise a.a. identities <40%, I think it might be a good idea if you reviewed some VADR predictions if you haven't already; you're probably the best qualified to understand what is in there and check whether it looks reasonable.
To explain my reasoning: As I read the ICTV web site, these domains are conserved throughout Cov. Pairwise a.a.'s of the conserved domains between highly diverged Cov's are 20 - 40%. In my experience, proteins that are this far diverged are conserved only in hydrophobic core regions and a few critical functional surface loops. The rest diverges in both sequence and structure.
@rcedgar , thanks for the clarification. I will review some of the VADR output files to see if they provide what we need here.
It turns out that for divergent CoVs, the maturation proteins do not get annotated. I opened an issue with Eric regarding forcing VADR to do so even if divergent.
The fastest thing for me to do would be to take the VADR polyprotein calls, and annotate them using the Pfam SARS-CoV-2 models. I can then use that output for extracting the amino acid sequences that @rcedgar and @Pbdas need.
OK, I have this basically working in my dev environment. I am to tired to continue, so tomorrow morning I'll package it nicely in the docker image, test it, and then send it over to @rchikhi for scale-up.
Ah nice, so, if I understand correctly, what @taltman will provide me implements the step (1) from https://github.com/ababaian/serratus/issues/196#issuecomment-657249214 but step (2) remains to be implemented, right?
For (1) it's something like this (copy-pasted from one of my own local scripts:
Note that HMMs in Pfam-A.SARS-CoV-2.hmm include score cutoffs, so instead of using E-values it is preferable to use GA thresholds via --cut_ga
Yes, implementing it with the --cut_ga option. Thanks!
I've improved @ababaian's table on the Wiki: https://github.com/ababaian/serratus/wiki/Translation-Table-for-ORF-1ab-maturation-protein-models
Please let me know how it can be improved. Thanks!
I've experienced an issue with inconsistent annotation for NSP14. I'm going to copy & paste the relevant parts of me email to Eric Nawrocki at NCBI (author of VADR, Infernal). I've reached out to Pfam and UniProt as well.
When I run VADR on the NCBI reference SARS-CoV-2 genome, it annotates an exoribonuclease. I get the following output line in the .ftr file from running VADR on the SARS-CoV-2 reference genome from NCBI:
1.36 NC_045512.2 29903 PASS NC_045512 mat_peptide 3'-to-5'_exonuclease 1581 36 + 18040 19620 - no - - - - 1 0 18040..19620:+ 18040..19620:+ -
Here it's described as a 3' -> 5' exonuclease. This report from a viral taxonomy group describe it as a 3' -> 5' exoribonuclease (l believe we're talking about the same thing): https://talk.ictvonline.org/ictv-reports/ictv_9th_report/positive-sense-rna-viruses-2011/w/posrna_viruses/222/coronaviridae
... and below please find the corresponding FASTA file & defline.
When I take the FASTA file that VADR generates for the maturation protein, translate it, and then run it through InterProScan, I get this result:
https://www.ebi.ac.uk/interpro/result/InterProScan/iprscan5-R20200713-071227-0025-51262989-p2m/
If you follow the annotation link to Pfam, you'll see that the entry says that this domain was thought to be the exoribonuclease, but now it is identified as NSP11, with no known function:
https://www.ebi.ac.uk/interpro/entry/pfam/PF06471/
I noticed this same annotation discrepancy between the "COVID-19" release of UniProt, and Pfam, and I've already sent a similar email to the UniProt team.
Everyone seems to be in agreement on the annotation except Pfam, and the Pfam description doesn't seem to correspond to the citation that they list, so I'm now leaning towards thinking that this is an error in the Pfam description. But I'm not a virologist nor a protein expert, so the highest probability is that the error lies with me.
Gene sequence identified by VADR:
>NC_045512.2/mat_peptide.13/18040..19620:+
GCTGAAAATGTAACAGGACTCTTTAAAGATTGTAGTAAGGTAATCACTGGGTTACATCCT
ACACAGGCACCTACACACCTCAGTGTTGACACTAAATTCAAAACTGAAGGTTTATGTGTT
GACATACCTGGCATACCTAAGGACATGACCTATAGAAGACTCATCTCTATGATGGGTTTT
AAAATGAATTATCAAGTTAATGGTTACCCTAACATGTTTATCACCCGCGAAGAAGCTATA
AGACATGTACGTGCATGGATTGGCTTCGATGTCGAGGGGTGTCATGCTACTAGAGAAGCT
GTTGGTACCAATTTACCTTTACAGCTAGGTTTTTCTACAGGTGTTAACCTAGTTGCTGTA
CCTACAGGTTATGTTGATACACCTAATAATACAGATTTTTCCAGAGTTAGTGCTAAACCA
CCGCCTGGAGATCAATTTAAACACCTCATACCACTTATGTACAAAGGACTTCCTTGGAAT
GTAGTGCGTATAAAGATTGTACAAATGTTAAGTGACACACTTAAAAATCTCTCTGACAGA
GTCGTATTTGTCTTATGGGCACATGGCTTTGAGTTGACATCTATGAAGTATTTTGTGAAA
ATAGGACCTGAGCGCACCTGTTGTCTATGTGATAGACGTGCCACATGCTTTTCCACTGCT
TCAGACACTTATGCCTGTTGGCATCATTCTATTGGATTTGATTACGTCTATAATCCGTTT
ATGATTGATGTTCAACAATGGGGTTTTACAGGTAACCTACAAAGCAACCATGATCTGTAT
TGTCAAGTCCATGGTAATGCACATGTAGCTAGTTGTGATGCAATCATGACTAGGTGTCTA
GCTGTCCACGAGTGCTTTGTTAAGCGTGTTGACTGGACTATTGAATATCCTATAATTGGT
GATGAACTGAAGATTAATGCGGCTTGTAGAAAGGTTCAACACATGGTTGTTAAAGCTGCA
TTATTAGCAGACAAATTCCCAGTTCTTCACGACATTGGTAACCCTAAAGCTATTAAGTGT
GTACCTCAAGCTGATGTAGAATGGAAGTTCTATGATGCACAGCCTTGTAGTGACAAAGCT
TATAAAATAGAAGAATTATTCTATTCTTATGCCACACATTCTGACAAATTCACAGATGGT
GTATGCCTATTTTGGAATTGCAATGTCGATAGATATCCTGCTAATTCCATTGTTTGTAGA
TTTGACACTAGAGTGCTATCTAACCTTAACTTGCCTGGTTGTGATGGTGGCAGTTTGTAT
GTAAATAAACATGCATTCCACACACCAGCTTTTGATAAAAGTGCTTTTGTTAATTTAAAA
CAATTACCATTTTTCTATTACTCTGACAGTCCATGTGAGTCTCATGGAAAACAAGTAGTG
TCAGATATAGATTATGTACCACTAAAGTCTGCTACGTGTATAACACGTTGCAATTTAGGT
GGTGCTGTCTGTAGACATCATGCTAATGAGTACAGATTGTATCTCGATGCTTATAACATG
ATGATCTCAGCTGGCTTTAGCTTGTGGGTTTACAAACAATTTGATACTTATAACCTCTGG
AACACTTTTACAAGACTTCAG
Command like this needs to be run on three different references from cov5: (a) complete genomes, (b) refseq complete genomes, (c) fragments, i.e. everything which is not a complete genome. --- EDIT --- Actually for (c), we want fragments which are not close to a complete genome. This is because we want to know how many of our assemblies filled out a genome previously known only from a fragment. This needs a threshold, say 97%, for the fragment to be different.
Done. cov5.fa
had:
And after 97% clustering (and dropping fragment too similar to refs):
Data is here: https://serratus-public.s3.amazonaws.com/seq/cov5/cov5.complete.uclust0.99.fa https://serratus-public.s3.amazonaws.com/seq/cov5/cov5.frag.97p.uclust0.99.fa
Let me know quickly if something looks off, because I'll run analysis for the master table based on it.
My commands were:
python split_cov5.py
cat cov5.rs.fa cov5.gb.fa > cov5.complete.fa
./usearch11.0.667_i86linux32 -cluster_fast cov5.complete.fa -id 0.99 -centroids cov5.complete.uclust0.99.fa
\time minimap2 -c -t 16 cov5.complete.uclust0.99.fa cov5.frag.fa > cov5.frag.fa.paf
python paf_97p_drop.py
./usearch11.0.667_i86linux32 -cluster_fast cov5.frag.97p.fa -id 0.97 -centroids cov5.frag.97p.uclust0.99.fa
with scripts available here.
Intermediate res: the 3509 fragments were obtained by clustering 10444 fragments which didn't not match the ref at >=97% id AND >= 90% alignment length (out of the original 34306 fragments).
For the master table, complete genomes only, don't worry about fragments, they won't be included.
according to https://github.com/ababaian/serratus/issues/193#issuecomment-657126752, they are?
Yes, we need a database of fragments so you can align assemblies to it, you're right.
Note we need a tsv with the clustering results so that we know which cluster each sequence belongs to, I think that's missing from your uclust command line, the easiest way to do this is to use the -uc option, e.g. -uc cov5.complete.uclust0.99.uc
.
For minimap2, I don't see your -x option and it doesn't seem to document what the default is, I use -x map-pb
. If you generally get long alignments covering the whole genome it's probably fine.
For minimap2, I don't see your -x option and it doesn't seem to document what the default is, I use
-x map-pb
. If you generally get long alignments covering the whole genome it's probably fine.
default is k=15 without homopolymer compression.
It seems to correspond exactly to -x map-ont
. (as per the code, which sets k to 15 but k was already 15 by default.)
-x may-pb
enables homopolymer compression and sets k=19 which is a different sensitivity trade-off. (and does nothing else)
It's really not clear if we should use map-pb
when mapping sequences which aren't pacbio reads.
Note we need a tsv with the clustering results so that we know which cluster each sequence belongs to, I think that's missing from your uclust command line, the easiest way to do this is to use the -uc option, e.g.
-uc cov5.complete.uclust0.99.uc
.
Done, uc's are in https://serratus-public.s3.amazonaws.com/seq/cov5/cov5.complete.uclust0.99.uc https://serratus-public.s3.amazonaws.com/seq/cov5/cov5.frag.97p.uclust0.99.uc
Seems this issue is fully resolved, tricky one, great teamwork!
For best taxonomy assignment and tree-building, we would ideally identify these "domains" per #195:
pp1ab: ADRP, nsp5 (3CLpro), nsp12 (RdRp), nsp13 (Hel), nsp14 (ExoN), nsp15 (NendoU) and nsp16 (O-MT)
What does "domain" mean here? This is an imprecise term as far as I'm aware. Should we attempt this, or is it impractical? Are there PFAM HMMs for these domains that would be credible to virologists?