Optimizing esl-translate for genomes where alternative start sites and/or unusual codon usage is common?

000generic commented 2 months ago

Hi!

I am using Hmmer3 with bivalve genomes looking for 70-80 aa genes. BUSCO scores for the genomes are in the upper 90s (96-98% complete for BUSCO Metazoa) and most assemblies are chromosome-scale with less than 100 to less than 2000 scaffolds in several cases. So overall, high quality I think.

Working with genome gene model proteomes and using RNAseq target gene family protein sequences from Uniprot for hmmbuild - I scanned the genomes with hmmsearch - and also with jackhmmer for three different initial sequences from three different species in the set - but in all cases and collectively, I failed to find in the genome protein gene models of a given species, sequences known for the species in its RNAseq transcriptome. This happened at scales of, for example, finding 4 when 11 are known in a species.

Given this, I then used esl-translate to do 6-frame translations and fed this into hmmsearch based on an issue thread in Hmmer's github. For example:

esl-translate /home/eric_edsinger/fhl/mytilus-2024/species24/genomes/USEME-species24-assembly-fastas/Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic.fna > output/3-Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic-esl_6_frames.aa &&

hmmsearch  -o output/3-Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic-ALL -A output/3-Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic-ALIGNMENTS --tblout output/3-Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic-HITS -E 1e-5 --incE 1e-5 --max --cpu 5 input/MFP3-build_01.hmm output/3-Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic-esl_6_frames.aa

and while I do find what maybe be additional hits in some cases, including in a species that previously had zero hits by the above protein-based methods - I am noticing that in cases where a protein gene model exists, I get typically two hits one at the start and a second further down per detected gene in the genome. However in some cases - and entirely for the species that had no protein-based hits, I lack the start-site hit and get only the second downstream hit - all of this being based on the alignments produced by Hmmer3.

This has me wondering if esl-translate is missing start sites or mistranslating things in the bivalve genomes.

Along these lines, bivalves are known to have unusual synonymous codon usage bias - and to commonly use alternative start codons - so for example, one study noted in their work:

"Alternative start codons were considered functional because they are common in Bivalvia. ORFs were annotated starting from the first available start codon (ATG, ATA, or ATC) downstream of the preceding gene, and ending with the first stop codon in frame (TAA or TAG)"

I was wondering if esl-translate might be using only one or two of these three start codons - or if its weighting of all three might be biased towards genetic models or vertebrates / is it possible to have all three start codons considered with equal weight (if weighting is even involved).

Its possible something similar happened in the genome projects for their structural annotation such that proteins known in the transcriptome go missing in the final genome model calls - and this is why things are not being found by Hmmer3 in the genome gene models. But assuming the genome assemblies are fairly good - this still leaves esl-translate-hmmsearch unresolved.

In the quoted paper above they used ORF Finder to find reading frames - and I could opt for this upstream of Hmmer3 - but I like the idea of using esl, since it seems like it is being developed in parallel to Hmmer.

Its entirely possible I am misunderstanding or misstating something fundamental in all of this - I'm self-taught on it / its not my expertise yet.

Any thoughts or guidance would be greatly appreciated!

Thank you very much :)

Eric

cryptogenomicon commented 2 months ago

Having trouble understanding your issue, sorry. Can you give me a simple and reproducible example where you think it is failing?

000generic commented 2 months ago

Sure, I can work on making a simple data set later today - and maybe these additional details also help:

60 sequences went into building an hmm with hmmbuild, including 16 genes from Mytilus coruscus from Uniprot and 12 from Mytilus galloprovincialis. But when hmmsearch then scans the genomes of the two species and four others, it comes back with nothing for Mytilus coruscus but the expect 12 for Mytilus galloprovincialis.

Mytilus_californianus__LOC127702694__XP_052062955.1 - rgs_MFP_gene_family - 8e-40 140.3 21.6 8.8e-40 140.2 21.6 1.0 1 0 0 1 1 1 1 - Mytilus_californianus__LOC127702690__XP_052062952.1 - rgs_MFP_gene_family - 7.3e-38 134.0 19.2 8e-38 133.9 19.2 1.0 1 0 0 1 1 1 1 - Mytilus_californianus__LOC127702693__XP_052062954.1 - rgs_MFP_gene_family - 7.3e-38 134.0 19.2 8e-38 133.9 19.2 1.0 1 0 0 1 1 1 1 - Mytilus_californianus__LOC127702692__XP_052062953.1 - rgs_MFP_gene_family - 2.9e-37 132.1 18.8 3.2e-37 132.0 18.8 1.0 1 0 0 1 1 1 1 - Mytilus_trossulus__LOC134721760__XP_063441031.1 - rgs_MFP_gene_family - 9.1e-37 130.5 19.0 1e-36 130.3 19.0 1.0 1 0 0 1 1 1 1 - Mytilus_trossulus__LOC134721758__XP_063441028.1 - rgs_MFP_gene_family - 4.9e-36 128.2 13.5 5.3e-36 128.0 13.5 1.0 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B036592__VDI05013.1 - rgs_MFP_gene_family - 6.4e-35 124.6 18.1 7e-35 124.5 18.1 1.0 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B001446__VDI24367.1 - rgs_MFP_gene_family - 2.5e-34 122.7 15.9 2.7e-34 122.6 15.9 1.0 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B038902__VDI77097.1 - rgs_MFP_gene_family - 5.3e-33 118.4 9.4 5.8e-33 118.3 9.4 1.0 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B073440__VDI05009.1 - rgs_MFP_gene_family - 6.4e-33 118.2 11.4 7.1e-33 118.0 11.4 1.0 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B039298__VDI48369.1 - rgs_MFP_gene_family - 9.4e-33 117.6 12.8 1.2e-32 117.3 12.8 1.1 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B055877__VDI77098.1 - rgs_MFP_gene_family - 1e-32 117.5 10.4 1.1e-32 117.4 10.4 1.0 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B056840__VDI05014.1 - rgs_MFP_gene_family - 1.3e-32 117.2 15.9 1.5e-32 117.0 15.9 1.1 1 0 0 1 1 1 1 - Mytilus_trossulus__LOC134721759__XP_063441030.1 - rgs_MFP_gene_family - 3.7e-31 112.5 15.9 4.2e-31 112.4 15.9 1.0 1 0 0 1 1 1 1 - Mytilus_californianus__LOC127702698__XP_052062959.1 - rgs_MFP_gene_family - 9.4e-31 111.2 12.5 1e-30 111.1 12.5 1.0 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B055545__VDI24368.1 - rgs_MFP_gene_family - 6.4e-30 108.6 12.6 8.1e-30 108.2 12.6 1.1 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B057357__VDI24369.1 - rgs_MFP_gene_family - 1.2e-28 104.4 14.1 1.5e-28 104.2 14.1 1.0 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B030697__VDI05011.1 - rgs_MFP_gene_family - 5.2e-28 102.4 10.5 5.6e-28 102.3 10.5 1.0 1 0 0 1 1 1 1 - Mytilus_galloprovincialis__MGAL_10B089988__VDI05012.1 - rgs_MFP_gene_family - 1.3e-23 88.3 63.6 7.9e-12 50.6 35.3 2.0 1 1 1 2 2 2 2 - Mytilus_edulis__MEDL_43135__CAG2230257.1 - rgs_MFP_gene_family - 1.3e-18 72.3 12.2 1.3e-18 72.3 12.2 1.7 2 0 0 2 2 1 1 - Mytilus_galloprovincialis__MGAL_10B053430__VDI77095.1 - rgs_MFP_gene_family - 3.9e-08 38.7 1.8 5.5e-08 38.3 1.8 1.2 1 0 0 1 1 1 1 - Mytilus_trossulus__LOC134721756__XP_063441026.1 - rgs_MFP_gene_family - 3.5e-06 32.5 0.8 6.8e-06 31.6 0.8 1.4 1 0 0 1 1 1 1 -

So Mytilus coruscus has gene family members known from I'm guessing an RNAseq transcriptome that made their way into Uniprot, where I found them - but the genes appear to be absent in the genome gene models.

When esl-translate ORFs of Mytilus coruscus are searched, it comes back as:

# STOCKHOLM 1.0
#=GF ID MFP3
#=GF AU hmmsearch (HMMER 3.4)

#=GS orf20428287/16-62 DE [subseq from] source=CM029600.1 coords=30413301..30413486 length=62 frame=3 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence
#=GS orf20427840/20-66 DE [subseq from] source=CM029600.1 coords=30385651..30385848 length=66 frame=1 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence
#=GS orf22799393/16-69 DE [subseq from] source=CM029600.1 coords=30294787..30294560 length=76 frame=6 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence
#=GS orf22799735/6-50  DE [subseq from] source=CM029600.1 coords=30274542..30274372 length=57 frame=4 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence
#=GS orf22800242/6-50  DE [subseq from] source=CM029600.1 coords=30244915..30244745 length=57 frame=6 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence
#=GS orf20428478/7-51  DE [subseq from] source=CM029600.1 coords=30425142..30425315 length=58 frame=3 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence
#=GS orf22798969/1-44  DE [subseq from] source=CM029600.1 coords=30320692..30320534 length=53 frame=6 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence
#=GS orf22798463/17-60 DE [subseq from] source=CM029600.1 coords=30351327..30351055 length=91 frame=4 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence
#=GS orf22799147/54-95 DE [subseq from] source=CM029600.1 coords=30309723..30309439 length=95 frame=4 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence
#=GS orf20427777/13-34 DE [subseq from] source=CM029600.1 coords=30382349..30382456 length=36 frame=2 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence
#=GS orf20428257/13-34 DE [subseq from] source=CM029600.1 coords=30411528..30411635 length=36 frame=3 desc=Mytilus coruscus isolate M156 linkage group LG06, whole genome shotgun sequence

orf20428287/16-62         -------------------------QYYGPNYNYPRLYRG-..KYNGYNGYPRG.YGWNKGWNKGWNKGRWGRKYY-
#=GR orf20428287/16-62 PP .........................79***********99...***********.*******************98.
orf20427840/20-66         -------------------------QYYGPNYNYPRRYGG-..KYNGYKGYPRG.YGWNKGWKKGWNRGRWGRKYY-
#=GR orf20427840/20-66 PP .........................79***********87...***********.*******************98.
orf22799393/16-69         -----------------FSFQLFLFQGYDLGYNAPWPYNNGyyGYNGYNGYHGR.YG----WNKGWNSGPWGGSYY-
#=GR orf22799393/16-69 PP .................5566666789**************999**********.**....**************9.
orf22799735/6-50          --------------------------GYDLGYNAPWPYNNGyyGYNGYNGYHGR.YG----WNKGWNSGPWGGSYY-
#=GR orf22799735/6-50  PP ..........................79*************999**********.**....**************9.
orf22800242/6-50          --------------------------GYDLGYNAPWPYNNGyyGYNGYNGYHGR.YG----WNKGWNSGPWGGSYY-
#=GR orf22800242/6-50  PP ..........................79*************999**********.**....**************9.
orf20428478/7-51          -------------------------QDYYPGYNSPLQYNGY.yGYSGYNGYHGR.YG----WNKGWNNGPWGGSYY-
#=GR orf20428478/7-51  PP .........................58************97.79**********.**....**************9.
orf22798969/1-44          --------------------------GYDPDFNSHWPYNNGyhGYNRYTGYHGS.YG----WNKGWNNGPWGRSY--
#=GR orf22798969/1-44  PP ..........................59************88888888888888.88....************97..
orf22798463/17-60         -------------------------QGYDPDFNSPWPFNNDyhGSNGYTGYHRS.YG----WNKGWNNGPWEEY---
#=GR orf22798463/17-60 PP .........................58*************766789********.**....*********9875...
orf22799147/54-95         -------------------------QNYYPNSNYPRRYKG-..NYN--NGYPTRnYG----WNKGWNKGRWRRKYY-
#=GR orf22799147/54-95 PP .........................689**********99...665..888877588....*************98.
orf20427777/13-34         MNNLSIAVLVALVLIGSFAVQS-------------------..-----------.----------------------
#=GR orf20427777/13-34 PP 9********************9.......................................................
orf20428257/13-34         MNNLSVAVLVALVLIGSFAVQS-------------------..-----------.----------------------
#=GR orf20428257/13-34 PP 9********************9.......................................................
#=GC PP_cons              9****************8899866768************99..999******99.******************999.
#=GC RF                   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..xxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxx
//

It seems like there are more downstream hits within genes - and only 2 start site hits - when I would expect in most cases 2 hits per gene. Its possible the starts are just very divergent but given that the genes are also absent in the gene models from the genome project - I started wondering if start sites are an issue in both cases.

When I align all the coruscus Uniprot sequences to themselves, sequence similiarity is in the 90s for most of the sequences - and in the 70s and 80s for gallopro sequences to themselves. So it seems like the sequences should be readily detected if presented to the hmm.

Thus, for both the protein models and the esl-translate, I am wondering if the full or partial sequences are not being presented to the hmm - and if the missing sequences might be due to ORF calls in both cases - and might be similarly due to unusual codon usage stats in bivalves.

cryptogenomicon commented 2 months ago

Still not understanding the issue, sorry - you're including way too much detail for me to follow. I really need a clear, simple, reproducible example if you think we have a bug or issue in the code, please.

EddyRivasLab / easel

Optimizing esl-translate for genomes where alternative start sites and/or unusual codon usage is common? #77