Closed GwennyGit closed 9 months ago
Maybe we can even extend on that and implement some kind of automated download from f. ex. BiGG and ModelSEED which gets loaded into that database and can then be used by all scripts. It might be worth some research which of the databases commonly used in our work can be downloaded automatically and stored. I know that Finn and Reihaneh did some kind of download in the MCC tool. Maybe we can adapt from that?
I just found out that BioServices
has a module called BiGG
. I think with this module the automatic download of the reactions and metabolites from BiGG could be implemented. Here one can see the source code: https://bioservices.readthedocs.io/en/main/_modules/bioservices/bigg.html
Do you think that this would be faster than just using requests
and the API directly (as I showed in #52)?
Well I did not try so far to get the ID mapping from KEGG/BioCyc IDs to BiGG from the API. Additionally, you mentioned before that it might be better to download the files containing the 'databases' for BiGG metabolites and reactions on demand. Hence, I thought it might be good to use this API for that issue.
I looked today into the issue regarding adding the BiGG reactions and metabolites tables into the database. If one uses the API access via the python package requests
using the URLs 'http://bigg.ucsd.edu/api/v2/universal/reactions'
and 'http://bigg.ucsd.edu/api/v2/universal/metabolites'
the result is a list of dictionaries where in each dictionary the 'bigg_id', 'name' and 'model_id' is contained. However, for our processing in gapfill
the 'model_id' is not important but the external database links which are not contained in the results. To get these the API could be used again which could be computationally demanding as each reaction/metabolite needs to be queried. Another solution would be to download the TXT files with requests
and add the tables to the database.
One major difference between the TXT files and the API version is that from the API only the reactions/metabolites for the 'universal' model are returned. However, I compared the amount of entries for the reactions which seems to be the same.
I think the best way to handle this is to fetch the TXT file with requests
each time a new run of refineGEMs is done. That is still a bit expensive but way less that fetching each entity by itself. The advantage of always fetching the TXT is that the local database will still be as good as possible. Maybe we can implement this for ModelSEED as well. There is a table in the data
folder but this table is not updated at the moment.
In the MCC tool automated access to the databases is the first step. Here all the databases are downloaded and stored locally. Maybe we can adapt from some of the code?
OK, then I will implement it like that. I had the ID to only fetch content from the BiGG database if a newer version is released. However, this idea will not work for the TXT files as these are not obtained via the API as far as I understood it. Regarding the ModelSEED implementation and how databases are handled in MCC I will have a look.
Currently, only the BiGG and SBO tables are contained in the database. The sboann
module should be working like before. For this module not much had to be changed. For the BiGG tables the access to these tables is not used/established but will be in the upcoming commits.
I inferred that @famosab obtained the TSV file for ModelSEED from the respective GitHub page. On this site https://modelseed.org/about/version the API access is mentioned, but I am not quite sure how to obtain a table similar to the TSV file. It seems to me that in MCC the file is downloaded via GitHub as well if no other path is provided. 🤔
Additionally, do we want to integrate the media CSV file as well? I just found out that pandas can directly connect to a SQL database. So we could offer the user a function - maybe in io
- that provides functionality to add a medium from a CSV file and also to obtain a CSV file that could be used with e.g. CarveMe's gapfill
function. What do you think?
Currently, only the database set-up is done and only the sboann
module uses actively the database as there were minimal changes necessary. Regarding the implementation/integration in gapfill
I would do this in the gapfill
branch.
@famosab For the modules modelseed
and growth
I wanted to ask you if you could maybe do that. As you wrote these modules I think it will be easier for you to integrate the database part. For the io
functions I would say we will see who of us will have time for that if that is OK for you, too.
I will review the PR and I can implement the changes into modelseed
and growth
. I do not have much time now, but I will have a look at your implementation in gapfill
and try to adjust it to "my modules".
As for one function in gapfill
the required table is needed as data frame a function loading a table via the table name from the database data.db
was added to io
. To enable access to the ModelSEED compounds table the same function can be used. Thus, for now this is how modelseed
will be connected to databases
/data.db
. As for modelseed
only the rows with BiGG identifiers are needed an SQL query is used to extract these rows.
As the table for the media is slightly different implemented in data.db
I will have to take another look on how to quickly integrate data.db
with growth
.
Connected gapfill
, modelseed
and growth
successfully to the database data
.
The only remaining task is the addition of the io
functions stated in the description of this issue.
All commits multiplicated commits since the last comment are the result of my attempt to rebase these commits from dev
to another branch. However, this did not work and, thus, I will cherry-pick the according commits to the other branch. This will likely add the same commits again in this issue after this comment.
I removed H3BO3 from the substance2db
table as I did not find any BiGG, KEGG or SEED identifiers for this substance.
I removed all entries for the substances EDTA
, Phenol Red
, Resazurin
and H3BO3
. These substances seem not to affect growth but are only needed for lab researchers.
2,2-Bipyridine
remains for now disconnected in the medium2substance
table as I was unsure whether to remove it.
As 2,2'-Bipyridine
was added to the SNM3 according to the paper to mimic iron deficiency, this compound was adequately integrated into the database with the according formula and database identifiers. No BiGG or KEGG identifiers for 2,2'-Bipyridine
existed, so I added only identifiers from the SEED and MetaNetX namespace.
@NantiaL provided the full flux table for the medium ‘Blood’ as the version from the paper ‘New workflow predicts drug targets against sars-cov-2 via metabolic changes in infected cells’ only contains the fluxes and metabolites of the medium ‘Blood’ relevant for the study.
In the case of L-Cysteine
and L-Cystine
the identifiers were incorrect, and, thus, the flux could not be mapped according to the identifier. In this case, I mapped the fluxes according to the provided full name.
For the case of Sorbitol
we have currently an identifier for L-Sorbitol
and one for D-Sorbitol
. Hence, I mapped the flux for Sorbitol
to both of the identifiers.
All other identifiers and full names could be directly mapped to exactly one identifier already in the database.
In/Out functions for the database should work now, @GwennyGit .
Additionally, format_substance_table
was added to allow reformatting of the substance table. The format growth_old
is a first try to create a table that works with the current growth module, but has yet to be tested.
To produce an entry for the documentation, I would suggest using the 'documentation' type in load_substance_table_from_db
, but that is currently only an idea.
NOTE: One reformatting step is still missing for a working conversion to the growth module, but I am on it. -> DONE, Suggestions and wishes welcome to improve/extend formatting of the output tables.
Added a writer fro rst table for the medium table to the medium class, which can new be called by using export_to_file with the type rst or docs.
@GwennyGit - please check if the table is how you wanted it and if the default sizes are fitting or should be adjusted.
All listed tasks are done — closing issue.
Feature request As refineGEMs already has a database to store the SBO annotation tables for the module
SBO annotation
, the idea would be to expand this database to also contain files necessary for other modules. Thus, reducing the accumulation of files, due to all files necessary for the modules being stored in one place.As @famosab already suggested creating a submodule
database.py
in issue #43, this module could also be used to initialise and load the internal refineGEMs database.Required steps:
data
databases.py
databases.py
should contain the following functions:modelseed
growth
gapfill
io
:data
-> For development/maintenancedata
load_medium_custom
could be changed or left as is? 🤔growth.py
to use either user-specified medium/medium fromdata.db
? 🤔media
tomedium
& change column namemedium
toname
reference
reference
substance
-> Contains the columnsid | substance | formula
formula
substance2db
-> Contains the columnssubstance_id | db_id | db_type
-> More rows for more identifier namespaces can be addedmedia_compositions
tomedium2substance
& rearrange to fit to the other media-related tablesBiGG
&substance
medium_id
to be the first columnsubstance_id
to refer to theid
insubstance
& move to the second columnflux
to enable growth analyses with per medium-defined fluxesorigin
to link the substances to their medium-specific names/derived from compound namesorigin
tosource
source