SysBioChalmers / yeast-GEM

The consensus GEM for Saccharomyces cerevisiae
http://sysbiochalmers.github.io/yeast-GEM/
Creative Commons Attribution 4.0 International
96 stars 46 forks source link

feat: gene name compliance with cobrapy #216

Closed BenjaSanchez closed 4 years ago

BenjaSanchez commented 4 years ago

Main improvements in this PR:

This partly addresses #102: After the merge of https://github.com/opencobra/cobrapy/pull/685, every component in yeast-GEM is properly parsed when loaded into cobrapy, with the exception of gene names (notebook with the 1st analysis here): this is because gene names are stored by cobratoolbox in the fbc:label field in the xml file, whereas cobrapy looks for these in the fbc:name field. So instead, I've adapted the Matlab saving wrapper so that it copies model.geneNames onto model.proteins, as that is translated to the compliant fbc:name field. By doing this we won't loose any information, as in model.proteins we had placeholders until now. The only downside is that the xml will now have duplicated info:

<fbc:geneProduct metaid="G_YPL078C" fbc:id="G_YPL078C" fbc:name="ATP4" fbc:label="ATP4"/>

But as we now over-write the protein info with the wrapper, and cobrapy only reads one of them, we should not have any issues. Also, from now on no need to redefine each time the model.proteins field @feiranl.

I hereby confirm that I have:

BenjaSanchez commented 4 years ago

@Midnighter do you foresee any potential issues with this decision? were you aware of this difference between cobrapy and cobratoolbox on how the gene names are stored in the xml file?

Midnighter commented 4 years ago

I was not aware of this difference. Taking a look at the FBC specification, the label field is a required attribute so practically it seems like the better choice. Since it is required, cobrapy must also set it when writing to SBML, though, doesn't it?

Can you please raise an issue on cobrapy about it?

BenjaSanchez commented 4 years ago

@Midnighter it does indeed set it, just equal to id from what I see in other models. I believe what cobrapy does currently (storing the gene name in the name field) is the better suited choice, and therefore this issue should maybe be raised on cobratoolbox instead. But let me know if you have a different view.

edkerk commented 4 years ago

IMHO, the proposed solution here should just be a workaround, as geneNames and proteins are not necessarily the same thing (alternative splicing can give different proteins). To solve this, either COBRA Toolbox or cobrapy should modify their code. FYI, RAVEN also stores it in label, to follow compatibility with COBRA toolbox.

BenjaSanchez commented 4 years ago

@edkerk I agree, this is a temporal workaround to guarantee that cobrapy users can get all model fields they need. The permanent solution is to have cobratoolbox and cobrapy agree with one format, and as soon as that happens we can prescind from this change and entirely remove the proteins field (which holds no info atm). I can open an issue in our repo to remember to do that. Let me know of you think another approach would be better.

Midnighter commented 4 years ago

From my perspective it's a good idea to coordinate on this. I don't really care if it's an issue on the toolbox or on cobrapy.

edkerk commented 4 years ago

Following @Midnighter's earlier comment, label is required, which means that it might be best if cobrapy adjusts to adhere to this?

BenjaSanchez commented 4 years ago

@edkerk I have now opened https://github.com/opencobra/cobrapy/issues/950 for discussion and #217 as a to-do in this repo once a solution is found. If anything else is needed let me know in a review :)