When the input proteome contains 2 or more very similar sequences, each of them well aligned (bitscore > 100) by Diamond on the same BiGG_gene (model.gene), only 1 of them is retained for subsequent processing.
This in my opinion is due to the following line contained in carveme/reconstruction/scoring.py :
The x.iloc[0] selects just 1 of the input proteins per each BiGG_gene.
Suppose the Diamond output for 2 of my input proteins is the following:
query
target
bitscore
evalue
identity
ppos
qcovhsp
scovhsp
gene_26222
iCN900.CD630_31350
288.0
2.620000e-97
47.7
70.7
96.0
98.3
gene_27280
iCN900.CD630_31350
283.0
1.240000e-95
47.9
70.5
99.7
98.3
gene_26222 will be retained for subsequent processing, while gene_27280 will be discarded (bitscore 288.0 wins over 283.0).
The consequence is that, when the GPR of the underlying reaction is constructed , only gene_26222 will appear.
In my opinion, it would be best to replace ‘gene_26222' in the GPR with '(gene_26222 or gene_27280)' .
Using carveme v1.5.2.
When the input proteome contains 2 or more very similar sequences, each of them well aligned (bitscore > 100) by Diamond on the same BiGG_gene (model.gene), only 1 of them is retained for subsequent processing. This in my opinion is due to the following line contained in carveme/reconstruction/scoring.py :
The x.iloc[0] selects just 1 of the input proteins per each BiGG_gene. Suppose the Diamond output for 2 of my input proteins is the following:
gene_26222 will be retained for subsequent processing, while gene_27280 will be discarded (bitscore 288.0 wins over 283.0). The consequence is that, when the GPR of the underlying reaction is constructed , only gene_26222 will appear. In my opinion, it would be best to replace ‘gene_26222' in the GPR with '(gene_26222 or gene_27280)' .