feat: python/matlab compatibility

BenjaSanchez commented 4 years ago

I tested how feasible would it be to save the model in python or matlab without loosing any fields, the results are in this branch. I have good news and bad news: The good news is that most components (reactions/metabolites/etc) stay ordered in the same way, so it is easy to spot differences in the stored files. The bad news is that cobrapy and cobratoolbox have many different criteria to write their XML files. A simple case of this is the following (from python to matlab):

In red you can see the cobrapy standard (keep the order of annotations as they were sorted in the corresponding list) and in green is cobratoolbox (sort them alphabetically). This could be fixed with a python wrapper that sorts fields alphabetically before saving, and python users remembering to use said wrapper instead of the normal saving function. Another issue is the way the compartments are stored:

As cobratoolbox employs the standard met[c] instead of met_c, but [] are non-compliant SBML symbols, hence the weird characters. Again, this could be solved with a wrapper, this time in Matlab, that modifies the file after being saved. Many other cases like this pop up that could be handled in similar ways.

Finally, I tested if anything is lost by going from python to Matlab and back to python, and I think the only two significant changes are the compartment notation mentioned above (which we could fix with the Matlab wrapper) and that the fields SUBSYSTEM and GENE_ASSOCIATION dissapear from the notes field:

But this is maybe fine as we had discussed with @vh-mol to remove duplicates anyways. So I guess the question to discuss here is what should our strategy be:

Option 1: I build matlab/python wrappers so that each of the collaborators can continue to work in their own languages. I'd be happy to implement this (would be hugely useful to the community), but you should expect things to maybe move a bit slower, as it is likely that new things pop up now and then, e.g. when you add a previously inexistent field, or when either of the cobra packages updates.
Option 2: Changes to the model are performed only in one language (python), and matlab contributors should provide their changes as tables to be read in python. This might work quicker from the get-go but could be slower in the long run, as it would entail additional work every time a change is desired by a matlab user.

Thoughts? @vh-mol @surtfire

surtfire commented 4 years ago

Hi @BenjaSanchez . Such a wrapper would prove useful to me to continue work in matlab and commit from there. Alternatively, if given the format for python I could send my proposed changes across manually, but I feel this would ultimately slow things down.

vh-mol commented 4 years ago

Thanks Ben!

Option 1: I build matlab/python wrappers so that each of the collaborators can continue to work in their own languages. I'd be happy to implement this (would be hugely useful to the community), but you should expect things to maybe move a bit slower, as it is likely that new things pop up now and then, e.g. when you add a previously inexistent field, or when either of the cobra packages updates.

I think that this is probably a good idea? If it saves time in the long run I think it is worth doing, especially if @BenjaSanchez feels it would really be appreciated by the community as well.

One other note: removing those two fields from the notes seem fine to me, the info is stored elsewhere anyway.

BenjaSanchez commented 4 years ago

ok then, will work on the wrappers during next week then and get back to you if I need anything from your side. Cheers!

BenjaSanchez commented 4 years ago

I have made some progress in the python side of this task (branch here), by:

Modifying cobrapy so that it saves all annotations sorted alphabetically.
Removed redundant reaction fields in notes (subsystems and gene rules).

@vh-mol I would ask you that for now you don't merge any new branches into master, as it will probably create conflicts with my branch (which changes thousands of lines). Also, please switch as soon as you can in your environment to my branch of cobrapy by running:

pip install git+git://github.com/BenjaSanchez/cobrapy.git@fix/sort-annotations --upgrade

Now to the bottleneck I'm having at the moment: 75 reactions are missing SBO terms (which is a requirement for cobratoolbox). I can assign them in the current branch, but could you confirm that they are properly sorted as "biochemical reaction" (SBO:0000176) or "translocation reaction" (SBO:0000185) as I wrote below?

Biochemical reactions (SBO:0000176):

CMTEPISO: CMTEPISO: cmtdepp_c <=> cthzp_c
DXYL5PTST: DXYL5PTST: dhgly_c + dxyl5p_c + h_c + tcscp_c <=> cmtdepp_c + 2.0 h2o_c + scpgg_c
LCTST: LCTST: cys__L_c + enzcys_c <=> ala__L_c + enzscys_c
ATPTAT: ATPTAT: atp_c + h_c + scpgg_c <=> ascp_c + ppi_c
LPROQOR: LPROQOR: pro__L_c + ubiquin_c --> pyr5c_c + qh2_c
BTNLIG: BTNLIG: atp_c + btn_c + h_c --> b5amp_c + ppi_c
BTN5AMPL: BTN5AMPL: b5amp_c + h2o_c <=> amp_c + btn_c + 2.0 h_c
MDH2: MDH2: mal__L_c + ubiquin_c --> oaa_c + qh2_c
OBO2OR: OBO2OR: 2obut_c + 2.0 h_c + o2_c + pi_c --> co2_c + h2o2_c + ppap_c
LALDPOR: LALDPOR: lald__L_c + 2.0 nad_c <=> mthgxl_c + 2.0 nadh_c
PHEAOR: PHEAOR: h2o_c + phe__D_c --> 2.0 h_c + nh4_c + phpyr_c
PYRLLOR: PYRLLOR: h_c + lpam_c + pyr_c --> adhlam_c + co2_c
MOX: MOX: mal__L_c + o2_c --> h2o2_c + oaa_c
TRPS3: TRPS3: 3ig3p_c --> g3p_c + indole_c
PGL: PGL: 6pgl_c + h2o_c --> 6pgc_c
MAHMPDC: MAHMPDC: 2mahmp_c + cthzp_c --> co2_c + h_c + ppi_c + thmmp_c
GSPMDS: GSPMDS: atp_c + gthrd_c + spmd_c --> adp_c + gtspmd_c + h_c + pi_c
OOR3r: OOR3r: akg_c + coa_c + 2.0 fdxox_c --> co2_c + 2.0 fdxrd_c + h_c + succoa_c
SUCOAACTr: SUCOAACTr: ac_c + succoa_c <=> accoa_c + succ_c
ACCOAACT: ACCOAACT: acoa_c + glyc3p_c <=> aglyc3p_c + coa_c
ACCOATT: ACCOATT: acoa_c + aglyc3p_c <=> coa_c + pa_EC_c
CTPPCT: CTPPCT: ctp_c + 5.0 h_c + pa_EC_c <=> cdpdag_c + ppi_c
GLYCNOR: GLYCNOR: glyclt_c + nad_c <-- glx_c + h_c + nadh_c
B23DONOR: B23DONOR: nad_c + rr23bdo_c <=> actn__R_c + h_c + nadh_c
ACTD: ACTD: actn__R_c + nad_c <=> diact_c + h_c + nadh_c
ACTNAT: ACTNAT: actn_c + coa_c + nad_c <=> acald_c + accoa_c + h_c + nadh_c
ACEDIA: ACEDIA: alac__S_c <=> co2_c + diact_c + h_c + hacc_c
HPI: HPI: hpyr_c <=> hop_c
HOXPRm: HOXPRm: glyc__R_c + nad_c <=> h_c + hop_c + nadh_c
ACOAD1: ACOAD1: btcoa_c + nad_c <=> b2coa_c + h_c + nadh_c
ECOAH1: ECOAH1: 3hbcoa_c <=> b2coa_c + h2o_c
ACTCO2L: ACTCO2L: acetone_c + atp_c + co2_c + 2.0 h2o_c --> acac_c + amp_c + 3.0 h_c + 2.0 pi_c
GLC__Dtpts: GLC__Dtpts: glc__D_e + pep_c --> g6p_c + pyr_c
FRUtpts: FRUtpts: fru_e + pep_c --> f6p_c + h_c + pyr_c
ARAB__Ltabc: ARAB__Ltabc: arab__L_e + atp_c + h2o_c --> adp_c + arab__L_c + h_c + pi_c
XYL__Dtabc: XYL__Dtabc: atp_c + h2o_c + xyl__D_e --> adp_c + h_c + pi_c + xyl__D_c
GALtabc: GALtabc: atp_c + gal_e + h2o_c --> adp_c + gal_c + h_c + pi_c
MNLtpts: MNLtpts: mnl_e + pep_c --> mnl1p_c + pyr_c
CELLBtpts: CELLBtpts: cellb_e + pep_c --> cellb6p_c + pyr_c
SUCRtpts: SUCRtpts: pep_c + sucr_e --> h_c + pyr_c + suc6p_c
RIB__Dtabc: RIB__Dtabc: atp_c + h2o_c + rib__D_e --> adp_c + h_c + pi_c + rib__D_c
MANtpts: MANtpts: man_e + pep_c --> man6p_c + pyr_c
SBTtpts: SBTtpts: pep_c + sbt__D_e --> h_c + pyr_c + sbt6p_c
ACGAMtpts: ACGAMtpts: acgam_e + pep_c --> acgam6p_c + pyr_c
ARBTtpts: ARBTtpts: arbt_e + pep_c --> arbt6p_c + h_c + pyr_c
SALCNtpts: SALCNtpts: pep_c + salcn_e --> pyr_c + salcn6p_c
MALTtabc: MALTtabc: atp_c + h2o_c + malt_e --> adp_c + h_c + malt_c + pi_c
MALTtpts: MALTtpts: malt_e + pep_c --> malt6p_c + pyr_c
TREtpts: TREtpts: pep_c + tre_e --> pyr_c + tre6p_c
GTBIHY: GTBIHY: gtbi_c + h2o_c --> 2.0 glc__D_c
TURAHY: TURAHY: h2o_c + tura_c --> fru_c + glc__D_c
KDG2R: KDG2R: glcn__D_c + nadp_c <=> h_c + kdg2_c + nadph_c
DGLCN5R: DGLCN5R: dglcn5_c + h_c + nadph_c --> glcn__D_c + nadp_c
BGAL: BGAL: h2o_c + mdgp_c --> glc__D_c + meoh_c
ALCD1: ALCD1: meoh_c + nad_c <=> fald_c + h_c + nadh_c
TAGtpts: TAGtpts: pep_c + tag__D_e --> h_c + pyr_c + tag1p__D_c
TAG1PK: TAG1PK: atp_c + h_c + tag1p__D_c --> adp_c + tagdp__D_c
TGBPA: TGBPA: tagdp__D_c <=> dhap_c + g3p_c
SUCD5: SUCD5: fadh2_c + 2.0 h_e + ubiquin_c --> fad_c + 2.0 h_c + qh2_c
NADHDH: NADHDH: 4.5 h_c + nadh_c + ubiquin_c --> 3.5 h_e + nad_c + qh2_c
CYTBO3: CYTBO3: 2.5 h_c + 0.5 o2_c + qh2_c --> h2o_c + 2.5 h_e + ubiquin_c

Translocation reactions (SBO:0000185):

GLYCt: GLYCt: glyc_e <=> glyc_c
RMNt: RMNt: h_e + rmn_e --> h_c + rmn_c
MELIBt: MELIBt: h_e + melib_e --> h_c + melib_c
GTBIt: GTBIt: gtbi_e <=> gtbi_c
TURAt: TURAt: tura_e --> tura_c
KDG2t: KDG2t: kdg2_e --> kdg2_c
DGLCN5t: DGLCN5t: dglcn5_e --> dglcn5_c
MDGPt: MDGPt: mdgp_e --> mdgp_c
AMYt: AMYt: amylose_e --> amylose_c
PYRt: PYRt: h_e + pyr_e <=> h_c + pyr_c
FORt: FORt: for_e + h_e <=> for_c + h_c
PIt: PIt: h_e + pi_e <=> h_c + pi_c
Kt: Kt: h_e + k_c --> h_c + k_e
Kt2: Kt2: k_e <=> k_c

vh-mol commented 4 years ago

Thanks for doing this! I will hold on merging any branches until you give the cue. I've run the update in my environment as instructed. I just get the following error messages at the end: I'm not sure how much of a problem this is. Should I just update my pandas, escher and cameo first and rerun this?

About the SBO annotations: All reactions in the translocation reactions are correct. But the reactions that end in 'tabc' or 'tpts' are translocation reactions via the PTS system or ABC transporters (they are currently sorted as biochemical reactions). As they seem a bit of both, I checked how these are catagorized in the iML1515 model, and there these trypes of reactions have the SBO:0000185 term. I guess this makes sense as they are not purely metabolic. I think this would be the way to do it, but am not sure. If you have a better solution, feel free to do so @BenjaSanchez.

BenjaSanchez commented 4 years ago

@vh-mol regarding the conflicting dependencies, for now just continue as normal, and if you run into errors when using some of those packages, try updating the specific one. If that creates further problems, let me know through here. Maybe we should just have a requirements.txt file to ensure your environment, mine and everyone else's are the same :)

regarding PTS/ABC transports, how about we use SBO:0000655? From their documentation I believe it's the most adequate:

SBO:0000176 (biochemical reaction): An event involving one or more chemical entities that modifies the electrochemical structure of at least one of the participants.
SBO:0000185 (translocation reaction): Movement of a physical entity without modification of the structure of the entity.
SBO:0000655 (transport reaction): The movement of an entity/entities across a biological membrane mediated by a transporter protein.

https://www.ebi.ac.uk/sbo/main/SBO:0000655

vh-mol commented 4 years ago

@BenjaSanchez sounds good! Both about the requirements.txt and the SBO:0000655. But right now in the README.md file, we also have a section on 'Dependencies'. We could also consider updating that, unless you think it will become a large piece of text, then maybe a seperate document we refer to may be better?

BenjaSanchez commented 4 years ago

@vh-mol the advantage of having that file is that you can just pip install -r requirements.txt from scratch and you get all packages, so we would update the README to say that ;)

vh-mol commented 4 years ago

@vh-mol the advantage of having that file is that you can just pip install -r requirements.txt from scratch and you get all packages, so we would update the README to say that ;)

Ah oke! Makes sense, sounds good :)

BenjaSanchez commented 4 years ago

I created PR #54 with all changes from the python side. Things included there besides what I already mentioned here:

Sort reaction stoichiometry (according to the metabolite order in the file)
Sort groups (alphabetically)
Sort group members (according to the reaction order in the file)

Hopefully no more changes on cobrapy will be required. I will now work on the Matlab part, keep you posted!

surtfire commented 4 years ago

Thanks for preparing that for us Ben, let me know when the Matlab side is ready and I will upload my changes and post issues.

BenjaSanchez commented 4 years ago

@surtfire @vh-mol done with all changes: saving functions on both the python & matlab side have been adapted to allow compatibility. Please both review the changes (PR #56) and check if it works in your local setups. Cheers!

biosustain / p-thermo

feat: python/matlab compatibility #46