RamanLab / fbc_curation_matlab

fbc_curation contains MATLAB/COBRA helpers for reproducibility of fbc models.
GNU Lesser General Public License v3.0
2 stars 2 forks source link

mangled gene names with matlab-incompatible characters #11

Open exaexa opened 4 months ago

exaexa commented 4 months ago

We've seen a case now where if the model gene product IDs contain "special" characters such as - or ( or similar, these get mangled by cobratoolbox by encoding to their ASCII values. In turn, we've seen a report in curation where there is the following difference in gene product IDs in the report and in the model:

julia> setdiff(genes(model), genes_report)       # gene IDs that are in fbc_curation_matlab report but not in the model
13-element Vector{Any}:
 "G_YBR058C__45__A"
 "G_YCL005W__45__A"
 "G_YCR024C__45__A"
 "G_YDR322C__45__A"
 "G_YEL017C__45__A"
 "G_YER060W__45__A"
 "G_YHR001W__45__A"
 "G_YHR039C__45__A"
 "G_YLL018C__45__A"
 "G_YML081C__45__A"
 "G_YOL077W__45__A"
 "G_YPL096C__45__A"
 "G_YPR170W__45__B"

julia> setdiff(genes_report, genes(model))   # gene IDs in the model that are not in the report
13-element Vector{Any}:
 "G_YBR058C-A"
 "G_YCR024C-A"
 "G_YDR322C-A"
 "G_YEL017C-A"
 "G_YER060W-A"
 "G_YHR001W-A"
 "G_YHR039C-A"
 "G_YML081C-A"
 "G_YCL005W-A"
 "G_YOL077W-A"
 "G_YLL018C-A"
 "G_YPL096C-A"
 "G_YPR170W-B"

Technically this is an easy fix (the curators "just" walk the output CSVs manually and replace the mangled representations back), but it would be great to have some automated tool for this. Or at least have a warning printed, so that the users know that either

Thanks!

PS I think it would be greater to fix this directly in cobratoolbox, but since they depend on this mangling because of their eval use I somehow don't have much illusion about a good solution existing there.

PPS. the model is yeast-gem, in this particular instance here: https://www.ebi.ac.uk/biomodels/MODEL2204280003#Files

cc @feiranl @rsmsheriff @ntung