cdanielmachado / carveme

CarveMe: genome-scale metabolic model reconstruction
Other
144 stars 49 forks source link

CarveMe ignores highly similar genes during the building of GPRs #180

Open lazzarigioele opened 1 year ago

lazzarigioele commented 1 year ago

Using carveme v1.5.2.

When the input proteome contains 2 or more very similar sequences, each of them well aligned (bitscore > 100) by Diamond on the same BiGG_gene (model.gene), only 1 of them is retained for subsequent processing. This in my opinion is due to the following line contained in carveme/reconstruction/scoring.py :

gene2gene = annotation.query('score > 100') \
                          .sort_values(by='score', ascending=False) \
                          .groupby('BiGG_gene', as_index=False).apply(lambda x: x.iloc[0])

The x.iloc[0] selects just 1 of the input proteins per each BiGG_gene. Suppose the Diamond output for 2 of my input proteins is the following:

query target bitscore evalue identity ppos qcovhsp scovhsp
gene_26222 iCN900.CD630_31350 288.0 2.620000e-97 47.7 70.7 96.0 98.3
gene_27280 iCN900.CD630_31350 283.0 1.240000e-95 47.9 70.5 99.7 98.3

gene_26222 will be retained for subsequent processing, while gene_27280 will be discarded (bitscore 288.0 wins over 283.0). The consequence is that, when the GPR of the underlying reaction is constructed , only gene_26222 will appear. In my opinion, it would be best to replace ‘gene_26222' in the GPR with '(gene_26222 or gene_27280)' .