CarveMe ignores highly similar genes during the building of GPRs

Using carveme v1.5.2.

When the input proteome contains 2 or more very similar sequences, each of them well aligned (bitscore > 100) by Diamond on the same BiGG_gene (model.gene), only 1 of them is retained for subsequent processing. This in my opinion is due to the following line contained in carveme/reconstruction/scoring.py :

gene2gene = annotation.query('score > 100') \
                          .sort_values(by='score', ascending=False) \
                          .groupby('BiGG_gene', as_index=False).apply(lambda x: x.iloc[0])

The x.iloc[0] selects just 1 of the input proteins per each BiGG_gene. Suppose the Diamond output for 2 of my input proteins is the following:

query	target	bitscore	evalue	identity	ppos	qcovhsp	scovhsp
gene_26222	iCN900.CD630_31350	288.0	2.620000e-97	47.7	70.7	96.0	98.3
gene_27280	iCN900.CD630_31350	283.0	1.240000e-95	47.9	70.5	99.7	98.3

gene_26222 will be retained for subsequent processing, while gene_27280 will be discarded (bitscore 288.0 wins over 283.0). The consequence is that, when the GPR of the underlying reaction is constructed , only gene_26222 will appear. In my opinion, it would be best to replace ‘gene_26222' in the GPR with '(gene_26222 or gene_27280)' .

cdanielmachado / carveme

CarveMe ignores highly similar genes during the building of GPRs #180