drostlab / orthologr

Genome wide orthology inference and dNdS estimation
https://drostlab.github.io/orthologr/
GNU General Public License v2.0
89 stars 27 forks source link

ERROR: number of input seqs differ (aa: 1; nuc: 2)!! #32

Open madzafv opened 2 years ago

madzafv commented 2 years ago

Hello, I'm running dNdS() on the cds of 2 species containing 13486 orthologous pairs, but only 1754 genes get the calculations done for. The rest runs into this error.

ERROR: number of input seqs differ (aa: 1; nuc: 2)!!

I'm running the program as follow:

rm(list=ls())
library(orthologr);
getwd();
workingDir = "/users/mfariasv/data/mfariasv/aligned_newBFV2/dNdS/"
setwd(workingDir);
args = commandArgs(trailingOnly=TRUE)
query = args[1]
subject = args[2]
res <- dNdS( query , subject ,
                 ortho_detection = "RBH",
                 seq_type = "cds",
                 aa_aln_type     = "multiple",
                 aa_aln_tool     = "clustalo",
                 codon_aln_tool  = "pal2nal",
                 dnds_est.method = "YN",
                 comp_cores      = 1,
                 store_locally = TRUE)
write.csv(res, gsub(".fa","ZF.dNdS", basename(args[2])))

The program runs:

Starting orthology inference (RBH) and dNdS estimation (YN) using the follwing parameters:
query = 'ZFcdsorth.fa'
subject = 'BFcdsorth.fa'
seq_type = 'cds'
e-value: 1E-5
aa_aln_type = 'multiple'
aa_aln_tool = 'clustalo'
comp_cores = '1'

Creating folder 'orthologr_alignment_files' to store alignment files ...
Starting Orthology Inference ...
Running blastp: 2.9.0+ ...
There seem to be 6 coding sequences in your input dataset which cannot be properly divided in base triplets, because their sequence length cannot be divided by 3.
A fasta file storing all corrupted coding sequences for inspection was generated and stored at '/gpfs/data/ehuertas/mfariasv/aligned_newBFV2/dNdS/ZFcdsorth.fa_corrupted_cds
_seqs.fasta'.

You chose option 'delete_corrupt_cds = FALSE', thus corrupted coding sequences were retained for subsequent analyses.
The following modifications were made to the CDS sequences that were not divisible by 3:
- If the sequence had 1 residue nucleotide then the last nucleotide of the sequence was removed.
- If the sequence had 2 residue nucleotides then the last two nucleotides of the sequence were removed.
If after consulting the file 'ZFcdsorth.fa_corrupted_cds_seqs.fasta' you wish to remove all corrupted coding sequences please specify the argument 'delete_corrupt_cds = TRU
E'.
All corrupted CDS were trimmed.

Building a new DB, current time: 01/24/2022 19:25:22
New DB name:   /tmp/RtmpUFtcuE/_blast_db/blastdb_BFcdsorth.fa_protein.fasta
New DB title:  blastdb_BFcdsorth.fa_protein.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 13486 sequences in 0.380335 seconds.
Running blastp: 2.9.0+ ...
There seem to be 6 coding sequences in your input dataset which cannot be properly divided in base triplets, because their sequence length cannot be divided by 3.
A fasta file storing all corrupted coding sequences for inspection was generated and stored at '/gpfs/data/ehuertas/mfariasv/aligned_newBFV2/dNdS/ZFcdsorth.fa_corrupted_cds
_seqs.fasta'.

You chose option 'delete_corrupt_cds = FALSE', thus corrupted coding sequences were retained for subsequent analyses.
The following modifications were made to the CDS sequences that were not divisible by 3:
- If the sequence had 1 residue nucleotide then the last nucleotide of the sequence was removed.
- If the sequence had 2 residue nucleotides then the last two nucleotides of the sequence were removed.
If after consulting the file 'ZFcdsorth.fa_corrupted_cds_seqs.fasta' you wish to remove all corrupted coding sequences please specify the argument 'delete_corrupt_cds = TRU
E'.
All corrupted CDS were trimmed.

Building a new DB, current time: 01/24/2022 20:21:27
New DB name:   /tmp/RtmpUFtcuE/_blast_db/blastdb_ZFcdsorth.fa_protein.fasta
New DB title:  blastdb_ZFcdsorth.fa_protein.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 13486 sequences in 0.404176 seconds.
There seem to be 6 coding sequences in your input dataset which cannot be properly divided in base triplets, because their sequence length cannot be divided by 3.
A fasta file storing all corrupted coding sequences for inspection was generated and stored at '/gpfs/data/ehuertas/mfariasv/aligned_newBFV2/dNdS/ZFcdsorth.fa_corrupted_cds
_seqs.fasta'.

You chose option 'delete_corrupt_cds = FALSE', thus corrupted coding sequences were retained for subsequent analyses.
The following modifications were made to the CDS sequences that were not divisible by 3:
- If the sequence had 1 residue nucleotide then the last nucleotide of the sequence was removed.
- If the sequence had 2 residue nucleotides then the last two nucleotides of the sequence were removed.
If after consulting the file 'ZFcdsorth.fa_corrupted_cds_seqs.fasta' you wish to remove all corrupted coding sequences please specify the argument 'delete_corrupt_cds = TRU
E'.
All corrupted CDS were trimmed.
Orthology Inference Completed.
Starting dN/dS Estimation ...

ERROR: number of input seqs differ (aa: 1;  nuc: 2)!!

   aa  'A1CF'
   nuc 'A1CF A1CF'
*****************************************************************
Function: Parse fasta file with aligned pairwise sequences into AXT file
Reference: Zhang Z, Li J, Zhao XQ, Wang J, Wong GK, Yu J: KaKs Calculator: Calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinforma
tics 2006 , 4:259-263.
Web Link: Documentation, example and updates at <http://code.google.com/p/kaks-calculator>
*****************************************************************

I noticed that all the orthologous pairs for which the error DOES NOT have different names

[mfariasv@login005 dNdS]$ head BFcdsorthZF.dNdS
"","query_id","subject_id","dN","dS","dNdS","method","perc_identity","num_ident_matches","alig_length","mismatches","gap_openings","n_gaps","pos_match","ppos","q_start","q_end","q_len","qcov","qcovhsp","s_start","s_end","s_len","evalue","bit_score","score_raw"
"1","ABCF2","LOC110475106",0.000859565,0.0378507,0.0227094,"YN",99.801,501,502,1,0,0,501,99.8,52,553,553,100,91,123,624,624,0,1051,2719
[mfariasv@login005 dNdS]$ sed -n -e '/ABCF2/,/>/ p' ZFcdsorth.fa
>ABCF2
ATGCCCTCCGACCTGGCCAAAAAGAAGGCGGCCAAGAAGAAGGAGGCGGCCAAGGCCCGG
CAGCGGCCGCGCCGGGTCCCGGACGAGAACGGTGATGCCGGGACGGAGCCGCAGGAAGTC
CGGTCCCCGGAGGCCAACGGGACGGTGCTGCCAGGGAAATCCATGCTTTTGTCAGCTATT
GGGAAGCGAGAAGTGCCTATCCCAGAGCACATTGACATCTATCACCTGACCCGAGAGATG
CCTCCCAGTGACAAGACCCCTCTGCAGTGTGTGATGGAAGTGGATACAGAGAGGGCCATG
TTGGAGCGAGAAGCGGAACGTTTAGCTCATGAAGATGCGGAATGTGAGAAACTCCTGGAG
TTATATGAACGCCTGGAGGAGCTGGATGCTGATAAGGCAGAAGCACGAGCCTCACGTATC
CTTCACGGCTTGGGGTTCACACCGGCCATGCAGAGGAAGAAGCTGAAGGACTTCAGTGGT
GGCTGGCGAATGAGGGTGGCCCTTGCCAGAGCGCTCTTCATTCGGCCTTTCATGCTGCTG
CTTGATGAGCCCACAAACCACCTTGACCTGGATGCCTGTGTGTGGTTGGAGGAAGAGCTG
AAAACGTTCAAGCGGATTCTTGTGCTGATATCCCACTCCCAGGACTTCCTGAATGGCGTC
TGCACCAACATCATCCACATGCACAACCGCAAACTTAAGTACTACACGGGAAATTATGAT
CAGTATGTAAAGACTCGCTTAGAACTAGAAGAAAATCAAATGAAGCGATTCCACTGGGAG
CAAGATCAGATTGCTCATATGAAGAATTACATTGCACGATTTGGCCATGGTAGTGCGAAG
CTGGCCAGGCAAGCTCAGAGCAAGGAGAAGACCCTTCAAAAAATGATGGCTTCTGGCTTG
ACAGAGAGAGTTGTGAATGATAAGACTTTATCATTCTACTTTCCACCCTGTGGGAAAATT
CCCCCTCCTGTCATCATGGTGCAGAATGTCAGCTTCAGATACACCAAGGATGGGCCATGG
ATCTATAATAACCTGGAGTTTGGGATTGATCTGGATACTCGTGTAGCTCTTGTTGGACCC
AATGGAGCTGGAAAGTCAACACTGCTGAAACTGCTCACAGGAGAGCTGCTGCCCACAGAT
GGGATGATTCGCAAGCACTCACATGTGAAGATCGGTAGATACCACCAGCACTTGCAAGAG
CAGTTGGACTTAGACCTCTCACCATTGGAGTACATGCTGAAATGCTACCCAGAGATCAAG
GAGAAGGAGGAGATGAGGAAAATCATTGGCAGATACGGTTTGACAGGGAAGCAGCAGGTG
AGCCCCATCAGGAACCTCTCTGATGGGCAGAAGTGCCGTGTGTGCTTTGCCTGGCTGGCC
TGGCAGAACCCTCACATGCTCTTCCTGGACGAGCCCACCAACCACCTGGACATAGAAACC
ATAGATGCACTGGCAGATGCTATCAATGAGTTCGAGGGAGGAATGATGCTTGTCAGCCAT
GACTTCAGACTCATCCAACAGGTTGCACAGGAAATCTGGGTCTGTGAGAAGCAGACAATC
GCCAAGTGGCAAGGGGACATCCTTGCCTACAAGGAGCATCTCAAGTCGAAGCTGGTGGAT
GAGGACCCGCAGCTCACCAAACGGACCCACAATGTGTGA
>ABCG1
[mfariasv@login005 dNdS]$ sed -n -e '/LOC110475106/,/>/ p' BFcdsorth.fa
>LOC110475106
ATGCCCTCCGACCTGGCCAAGAAGAAGGCGGCCAAGAAGAAGGAGGCGGCCAAGGCCCGG
CAGCGGCCGCGCCGGGTCCCGGACGAGAACGGTGATGCCGGGACGGAGCCGCAGGAAGTC
CGGTCCCCGGAGGCCAACGGGACGGTGCTACCAGAGGTGGATGCTCTTACAAAGGAGCTG
GAGGATTTTGAGTTAAAGAAAGCTGCTGCCCGAGCCGTGACAGGAGTGCTGGCCTCCCAC
CCCAACAGCACTGATGTGCATATCATCAACCTCTCACTGACCTTTCATGGCCAGGAGCTG
CTGAGTGACACCAAACTGGAGCTGAACTCTGGGAGACGCTATGGCCTGATTGGACTCAAT
GGGATTGGGAAATCCATGCTTTTGTCAGCTATTGGGAAGCGAGAAGTGCCCATCCCAGAG
CACATTGACATCTATCACCTGACCCGAGAGATGCCTCCCAGTGACAAGACCCCTCTGCAG
TGTGTGATGGAAGTGGATACAGAGAGGGCCATGTTGGAGCGAGAAGCGGAACGTTTAGCT
CATGAAGATGCGGAATGTGAGAAACTCCTGGAGTTATATGAACGCCTGGAGGAGCTGGAT
GCTGATAAGGCAGAAGCACGAGCCTCACGTATCCTTCATGGCTTGGGGTTCACGCCGGCC
ATGCAGAGGAAGAAGCTGAAGGACTTCAGTGGTGGCTGGCGAATGAGGGTGGCCCTTGCC
AGAGCGCTCTTCATTCGGCCTTTCATGCTGCTGCTCGATGAGCCCACAAACCACCTTGAC
CTGGATGCCTGTGTGTGGTTGGAGGAAGAGCTGAAAACGTTCAAGCGGATTCTTGTGCTG
ATATCCCACTCCCAGGACTTCCTGAATGGTGTCTGCACCAACATCATCCACATGCACAAC
CGCAAACTTAAGTACTACACGGGAAATTATGATCAGTATGTAAAGACACGCTTAGAACTA
GAAGAAAATCAAATGAAGCGATTCCACTGGGAGCAAGATCAGATTGCTCATATGAAGAAT
TACATTGCACGATTTGGCCATGGTAGTGCGAAGCTGGCCAGGCAAGCTCAGAGCAAGGAG
AAGACCCTTCAAAAAATGATGGCTTCTGGCTTGACAGAGAGAGTTGTGAATGATAAGACT
TTATCATTCTACTTTCCACCCTGTGGGAAAATTCCCCCTCCTGTCATCATGGTGCAGAAT
GTCAGCTTCAGATACACCAAGGATGGGCCATGGATCTATAATAACCTGGAGTTTGGGATT
GACCTGGATACTCGTGTAGCTCTTGTTGGACCCAATGGAGCTGGAAAGTCAACCCTGCTG
AAACTGCTCACAGGAGAGCTGCTGCCCACAGATGGGATGATTCGCAAGCACTCGCATGTG
AAGATCGGTAGATACCACCAGCACTTGCAAGAGCAGTTGGACTTAGACCTCTCACCATTA
GAGTACATGCTGAAATGCTACCCAGAGATCAAGGAGAAGGAGGAGATGAGGAAAATCATT
GGCAGATACGGTTTGACAGGGAAGCAGCAGGTGAGTCCCATCAGGAACCTCTCTGATGGA
CAGAAGTGCCGTGTGTGCTTTGCCTGGCTGGCCTGGCAGAACCCTCACATGCTCTTCCTG
GATGAGCCCACCAACCACCTGGACATAGAAACTATAGATGCACTGGCAGATGCTATCAAT
GAGTTTGAGGGAGGAATGATGCTTGTCAGCCATGACTTCAGACTCATCCAACAGGTTGCA
CAGGAAATCTGGGTCTGTGAGAAGCAGACAATCACCAAGTGGCAAGGGGACATCCTTGCC
TACAAGGAGCATCTCAAGTCGAAGCTGGTGGATGAGGACCCGCAGCTCACCAAACGGACC
CACAACGTGTGA
>LOC110475116

While all genes for which the error happens and dNdS is not calculated have the same names in both species:

For example for ABCG1


[mfariasv@login005 dNdS]$ sed -n -e '/ABCG1/,/>/ p' ZFcdsorth.fa
>ABCG1
ATGGCATGTCTGATGGCGGCTTTCTCCCTGGGCAGCGCTTTGGGTGGCAGCAGTTCTGGT
TGCACCATGGCCGAGCCAAAGTCTGTGTGTGTTTCTGTGGACGAGGTGGTCTCCAATGGC
ACAGACACCCAGGACATCCGACTCATCAATGGACACTTAAAAAAAGTGGACAATGCTCTG
ACAGAAGCCCACAGGTTCTCCTACCTGCCCCGCAGGCCAGCTGTGAACATTGAGTTTAAA
GAACTGTCCTACTCTATCCAGGAAGGGCCATGGTGGAGAAAGAAAGGTTATAAGACCCTT
TTGAAAGGAATTTCAGGGAAATTCAGCAGTGGAGAACTCGTTGCAATTATGGGACCTTCA
GGAGCTGGGAAGTCAACGCTTATGAATATTCTGGCAGGATACAGAGAGACGGGGATGAAA
GGAGAAATCCTCATCAACGGGCAGCCCCGCGACCTGCGCTCCTTCCGCAAGGTCTCCTGC
TACATCATGCAGGATGACATGCTCCTTCCTCACCTCACTGTCCAGGAAGCTATGATGGTA
TCTGCTCATCTGAAACTTCAAGAGAAAGATGAAGGGAGGAGAGAAATGGTGAAGGAAATC
CTGACAGCCCTTGGTTTGCTGGCGTGTGCCAACACCAGGACTGGGAGTCTCTCAGGAGGC
CAGAGGAAGCGCCTCGCCATCGCTCTGGAGCTGGTGAACAACCCTCCTGTCATGTTCTTC
GATGAACCAACCAGTGGCTTGGACAGTGCATCATGTTTCCAGGTGGTCTCTCTGATGAAG
GCTTTGGCCCAGGGTGGCAGATCCATCATCTGCACGATTCACCAGCCCAGTGCAAAACTG
TTTGAGCTCTTTGACCAGCTCTATGTTCTAAGTCAAGGTCAGTGCATTTACCGTGGGAAG
GTGACAAACCTCGTCCCTTACTTGAGAGATTTGGGGTTGAATTGTCCAACCTACCACAAC
CCAGCAGATTTTGTGATGGAAGTGGCCTCGGGTGAGTACGGGGACCAGAACAGCCGCCTG
GTCAGGGCTGTGAGGGAGAGGATTTGTGACACAGACTACAAGAGAGACGTGGTTGGGGAG
CACGAGCTGAACCCCTTCCTCTGGCACCGGCCCTCTGAAGAGGACTCATCCTCCACAGAA
GGCTGCCACAGCTTCTCTGCCAGCTGCCTAACCCAGTTCTGCATCCTCTTCAAAAGAACT
TTCCTCACCATCATGAGAGACTCGGTCCTGACACACTTGAGGATCACCTCACACATTGGC
ATTGGGCTGCTCATTGGATTGCTCTACTTGGGCATTGGCAATGAAGCCAAGAAAGTCCTC
AGCAACTCGGGGTTCCTCTTCTTCTCCATGTTGTTCCTCATGTTTGCTGCGCTCATGCCG
ACCGTCCTCACCTTTCCCCTTGAGATGGGAGTGTTTCTCAGAGAGCACCTGAACTACTGG
TACAGCCTAAAAGCCTATTACCTCGCCAAAACCATGGCTGATGTTCCTTTCCAGATCATG
TTCCCTGTGGCTTACTGCAGCATCGTGTACTGGATGACTTCCCAGCCCTCCGACGCGCTC
CGCTTCGTCCTCTTCGCAGCCCTGGGGACCATGACATCCCTGGTGGCTCAGTCACTGGGC
CTGCTCATAGGTGCAGCCTCCACATCCCTCCAGGTGGCAACTTTTGTGGGCCCAGTTACT
GCCATCCCAGTCCTCCTGTTCTCTGGGTTTTTTGTCAGCTTTGACACCATCCCAACATAC
CTCCAGTGGATGTCCTACATTTCCTATGTCAGATATGGGTTCGAAGGAGTCATCCTCTCC
ATCTACGGACTGGATCGAGAAGATCTGCATTGTGACAAAGATGACACCTGCCACTTCCAA
AAATCAGAGGCCATCCTGAAAGAACTGGATGTAGAAAATGCCAAACTTTACCTGGACTTC
ATTGTTCTTGGGATTTTCTTCTTCTCTCTGCGCCTGATTGCCTATTTTGTCCTCAGATAC
AAAATCCGAGCGGAGAGGTAA
>ABCG2
[mfariasv@login005 dNdS]$ sed -n -e '/ABCG1/,/>/ p' BFcdsorth.fa
>ABCG1
ATGGCATGTCTGATGGCGGCTTTCTCCCTGGGCAGCGCTTCGGGTGGCAGCAGTTCTGGT
TGCACCATGGCCGAGCCAAAGTCTGTGTGTGTTTCTGTGGACGAGGTGGTCTCCAATGGC
ACAGACACCCAGGACATCCGACTCATCAATGGACACTTAAAAAAAGTGGACAATGCTCTG
ACAGAAGCTCACAGGTTCTCCTACCTGCCCCGCAGGCCAGCTGTGAACATTGAGTTTAAA
GAACTCTCCTACTCTATCCAGGAAGGGCCATGGTGGAGAAAGAAAGGTTATAAAACCCTT
TTGAAAGGAATTTCAGGGAAGTTCAGCAGTGGAGAGCTCGTTGCAATTATGGGACCTTCA
GGAGCTGGGAAGTCAACGCTTATGAATATTCTGGCAGGATACAGAGAGACGGGGATGAAA
GGAGAAATCCTCATCAACGGGCAGCCCCGCGACCTGCGCTCCTTCCGCAAGGTCTCCTGC
TACATCATGCAGGATGACATGCTCCTTCCTCACCTCACTGTCCAGGAAGCTATGATGGTA
TCTGCTCATCTGAAACTTCAAGAGAAAGATGAAGGGAGGAGAGAAATGGTGAAGGAAATC
CTGACAGCCCTTGGTTTGCTGGCCTGTGCCAACACCAGGACTGGGAGCCTCTCAGGAGGC
CAGAGGAAGCGCCTCGCCATCGCTCTGGAGCTGGTGAACAACCCTCCTGTCATGTTCTTC
GATGAACCAACCAGTGGCTTGGACAGTGCATCATGTTTTCAGGTGGTCTCTCTGATGAAG
GCTTTGGCCCAGGGTGGCAGATCCATCATCTGCACAATTCACCAGCCCAGTGCAAAACTG
TTTGAGCTCTTTGACCAGCTCTATGTTCTAAGTCAAGGTCAGTGCATTTACCGTGGGAAG
GTGACAAACCTTGTCCCTTACTTGAGAGATTTGGGGTTGAATTGTCCAACCTACCACAAC
CCAGCAGATTTTGTAATGGAAGTGGCCTCGGGTGAGTACGGGGACCAGAACAGCCGCCTG
GTCAGGGCTGTGAGAGAGAGGATTTGTGACACAGACTACAAGAGAGACGTGGCTGGGGAG
CACGAGCTGAACCCCTTCCTCTGGCACCGGCCCTCTGAAGAGGATTCCTCCTCCACAGAA
GGATGCCACAGCTTCTCTGCCAGCTGCCTAACCCAGTTCTGCATCCTCTTCAAAAGAACT
TTCCTCACCATCATGAGGGACTCGGTCCTGACACACTTGAGGATCACCTCACACATTGGC
ATTGGGCTGCTCATTGGACTGCTCTACTTGGGCATTGGCAATGAAGCCAAGAAAGTCCTC
AGCAACTCAGGGTTCCTCTTCTTCTCCATGTTGTTCCTCATGTTTGCTGCACTCATGCCG
ACCGTCCTCACCTTTCCCCTTGAGATGGGAGTGTTTCTCAGAGAGCATCTGAACTACTGG
TACAGCCTGAAAGCCTATTACCTCGCCAAAACCATGGCTGATGTTCCTTTTCAGATCATG
TTCCCTGTGGCTTACTGCAGCATCGTGTACTGGATGACTTCCCAGCCCTCCGACGCGCTC
CGCTTCGTCCTCTTCGCAGCCCTGGGGACCATGACATCCCTGGTGGCTCAGTCACTGGGC
CTGCTCATAGGTGCAGCCTCCACATCCCTCCAGGTGGCAACTTTTGTGGGCCCAGTTACT
GCCATCCCAGTCCTCCTGTTCTCTGGGTTTTTTGTCAGCTTTGACACCATCCCAACATAC
CTCCAGTGGATGTCCTACATTTCCTATGTCAGATACGGGTTCGAAGGAGTCATCCTCTCC
ATCTACGGACTGGATCGAGAAGATCTGCATTGTGACAAAGATGACACCTGCCACTTCCAA
AAATCAGAGGCCATCCTGAAAGAACTGGATGTAGAAAATGCCAAACTCTACCTGGACTTC
ATCGTTCTTGGGATTTTCTTCTTCTCTCTGCGCCTGATTGCCTATTTTGTCCTCAGATAC
AAAATCCGAGCGGAGAGGTAA
>ABCG2

But the sequences are indeed different:

diff ABCG1_BF ABCG1_ZF
2c2
< ATGGCATGTCTGATGGCGGCTTTCTCCCTGGGCAGCGCTTCGGGTGGCAGCAGTTCTGGT
---
> ATGGCATGTCTGATGGCGGCTTTCTCCCTGGGCAGCGCTTTGGGTGGCAGCAGTTCTGGT
5,7c5,7
< ACAGAAGCTCACAGGTTCTCCTACCTGCCCCGCAGGCCAGCTGTGAACATTGAGTTTAAA
< GAACTCTCCTACTCTATCCAGGAAGGGCCATGGTGGAGAAAGAAAGGTTATAAAACCCTT
< TTGAAAGGAATTTCAGGGAAGTTCAGCAGTGGAGAGCTCGTTGCAATTATGGGACCTTCA
---
> ACAGAAGCCCACAGGTTCTCCTACCTGCCCCGCAGGCCAGCTGTGAACATTGAGTTTAAA
> GAACTGTCCTACTCTATCCAGGAAGGGCCATGGTGGAGAAAGAAAGGTTATAAGACCCTT
> TTGAAAGGAATTTCAGGGAAATTCAGCAGTGGAGAACTCGTTGCAATTATGGGACCTTCA
12c12
< CTGACAGCCCTTGGTTTGCTGGCCTGTGCCAACACCAGGACTGGGAGCCTCTCAGGAGGC
---
> CTGACAGCCCTTGGTTTGCTGGCGTGTGCCAACACCAGGACTGGGAGTCTCTCAGGAGGC
14,15c14,15
< GATGAACCAACCAGTGGCTTGGACAGTGCATCATGTTTTCAGGTGGTCTCTCTGATGAAG
< GCTTTGGCCCAGGGTGGCAGATCCATCATCTGCACAATTCACCAGCCCAGTGCAAAACTG
---
> GATGAACCAACCAGTGGCTTGGACAGTGCATCATGTTTCCAGGTGGTCTCTCTGATGAAG
> GCTTTGGCCCAGGGTGGCAGATCCATCATCTGCACGATTCACCAGCCCAGTGCAAAACTG
17,26c17,26
< GTGACAAACCTTGTCCCTTACTTGAGAGATTTGGGGTTGAATTGTCCAACCTACCACAAC
< CCAGCAGATTTTGTAATGGAAGTGGCCTCGGGTGAGTACGGGGACCAGAACAGCCGCCTG
< GTCAGGGCTGTGAGAGAGAGGATTTGTGACACAGACTACAAGAGAGACGTGGCTGGGGAG
< CACGAGCTGAACCCCTTCCTCTGGCACCGGCCCTCTGAAGAGGATTCCTCCTCCACAGAA
< GGATGCCACAGCTTCTCTGCCAGCTGCCTAACCCAGTTCTGCATCCTCTTCAAAAGAACT
< TTCCTCACCATCATGAGGGACTCGGTCCTGACACACTTGAGGATCACCTCACACATTGGC
< ATTGGGCTGCTCATTGGACTGCTCTACTTGGGCATTGGCAATGAAGCCAAGAAAGTCCTC
< AGCAACTCAGGGTTCCTCTTCTTCTCCATGTTGTTCCTCATGTTTGCTGCACTCATGCCG
< ACCGTCCTCACCTTTCCCCTTGAGATGGGAGTGTTTCTCAGAGAGCATCTGAACTACTGG
< TACAGCCTGAAAGCCTATTACCTCGCCAAAACCATGGCTGATGTTCCTTTTCAGATCATG
---
> GTGACAAACCTCGTCCCTTACTTGAGAGATTTGGGGTTGAATTGTCCAACCTACCACAAC
> CCAGCAGATTTTGTGATGGAAGTGGCCTCGGGTGAGTACGGGGACCAGAACAGCCGCCTG
> GTCAGGGCTGTGAGGGAGAGGATTTGTGACACAGACTACAAGAGAGACGTGGTTGGGGAG
> CACGAGCTGAACCCCTTCCTCTGGCACCGGCCCTCTGAAGAGGACTCATCCTCCACAGAA
> GGCTGCCACAGCTTCTCTGCCAGCTGCCTAACCCAGTTCTGCATCCTCTTCAAAAGAACT
> TTCCTCACCATCATGAGAGACTCGGTCCTGACACACTTGAGGATCACCTCACACATTGGC
> ATTGGGCTGCTCATTGGATTGCTCTACTTGGGCATTGGCAATGAAGCCAAGAAAGTCCTC
> AGCAACTCGGGGTTCCTCTTCTTCTCCATGTTGTTCCTCATGTTTGCTGCGCTCATGCCG
> ACCGTCCTCACCTTTCCCCTTGAGATGGGAGTGTTTCTCAGAGAGCACCTGAACTACTGG
> TACAGCCTAAAAGCCTATTACCTCGCCAAAACCATGGCTGATGTTCCTTTCCAGATCATG
31c31
< CTCCAGTGGATGTCCTACATTTCCTATGTCAGATACGGGTTCGAAGGAGTCATCCTCTCC
---
> CTCCAGTGGATGTCCTACATTTCCTATGTCAGATATGGGTTCGAAGGAGTCATCCTCTCC
33,34c33,34
< AAATCAGAGGCCATCCTGAAAGAACTGGATGTAGAAAATGCCAAACTCTACCTGGACTTC
< ATCGTTCTTGGGATTTTCTTCTTCTCTCTGCGCCTGATTGCCTATTTTGTCCTCAGATAC
---
> AAATCAGAGGCCATCCTGAAAGAACTGGATGTAGAAAATGCCAAACTTTACCTGGACTTC
> ATTGTTCTTGGGATTTTCTTCTTCTCTCTGCGCCTGATTGCCTATTTTGTCCTCAGATAC
HajkD commented 2 years ago

Hi Madza,

Thank you very much for making me aware of this.

Did I understand the issue correctly that you have the same header names in two different fasta files (representing two different species), but behind each header name lies a different coding sequence? Can we assume that headers with the same name in two different species are supposed to be orthologous genes?

If I understood correctly, then it seems to me that internally the wrong header name is selected when computing dNdS. Did you try renaming the headers to from >ABCG1 to e.g. >ABCG1_BF and >ABCG1_ZF? If yes, does the same issue remain?

Would it be possible to construct a small example run with only a few sequences so that I can reproduce this issue and troubleshoot at each analysis step?

I hope this helps.

Cheers, Hajk