Store all necessary data in `data`

GwennyGit commented 1 year ago

Feature request As refineGEMs already has a database to store the SBO annotation tables for the module SBO annotation, the idea would be to expand this database to also contain files necessary for other modules. Thus, reducing the accumulation of files, due to all files necessary for the modules being stored in one place.
As @famosab already suggested creating a submodule database.py in issue #43, this module could also be used to initialise and load the internal refineGEMs database.

Required steps:

[x] Rename database to e.g. data
[x] Create databases.py databases.py should contain the following functions:
- [x] Initialise database
- [x] Is database valid
- [x] Load database
[x] Refactor code in the modules:
- [x] modelseed
- [x] growth
- [x] gapfill
[x] Add following new functions to io:
- [x] Adds new medium to data -> For development/maintenance
- [x] Extracts all media/a user-specified medium with their respective composition(s) as TSV/CSV file from data
- [x] Add function for easier extension of the database with new columns/entries (e.g. to add identifiers from other databases) -> For development/maintenance
- [x] Function to use user-defined medium -> load_medium_custom could be changed or left as is? 🤔
[x] Change code in growth.py to use either user-specified medium/medium from data.db? 🤔
[x] Extend the media-related tables
- [x] Change the name of the table media to medium & change column name medium to name
- [x] Add column reference
- [x] Fill column reference
- [x] Add the table substance -> Contains the columns id | substance | formula
- [x] Fill the column formula
- [x] Change substance names to one name per substance
- [x] Add the table substance2db -> Contains the columns substance_id | db_id | db_type -> More rows for more identifier namespaces can be added
- [x] Add SEED identifiers
- [x] Change the name of the table media_compositions to medium2substance & rearrange to fit to the other media-related tables
- [x] Remove the columns BiGG & substance
- [x] Move column medium_id to be the first column
- [x] Change column substance_id to refer to the id in substance & move to the second column
- [x] Additionally, add column flux to enable growth analyses with per medium-defined fluxes
- [x] Fill flux column so far possible ${\rightarrow \text{\textcolor{orange}{Growth\ analyses\ should\ still\ be\ able\ to\ run\ with\ a\ default\ mode\ if\ column\ 'flux'\ is\ empty!}}}$
- [x] Additionally, add column origin to link the substances to their medium-specific names/derived from compound names
- [x] Rename column origin to source
- [x] Fill column source

famosab commented 1 year ago

Maybe we can even extend on that and implement some kind of automated download from f. ex. BiGG and ModelSEED which gets loaded into that database and can then be used by all scripts. It might be worth some research which of the databases commonly used in our work can be downloaded automatically and stored. I know that Finn and Reihaneh did some kind of download in the MCC tool. Maybe we can adapt from that?

GwennyGit commented 1 year ago

I just found out that BioServices has a module called BiGG. I think with this module the automatic download of the reactions and metabolites from BiGG could be implemented. Here one can see the source code: https://bioservices.readthedocs.io/en/main/_modules/bioservices/bigg.html

famosab commented 1 year ago

Do you think that this would be faster than just using requests and the API directly (as I showed in #52)?

GwennyGit commented 1 year ago

Well I did not try so far to get the ID mapping from KEGG/BioCyc IDs to BiGG from the API. Additionally, you mentioned before that it might be better to download the files containing the 'databases' for BiGG metabolites and reactions on demand. Hence, I thought it might be good to use this API for that issue.

GwennyGit commented 1 year ago

I looked today into the issue regarding adding the BiGG reactions and metabolites tables into the database. If one uses the API access via the python package requests using the URLs 'http://bigg.ucsd.edu/api/v2/universal/reactions' and 'http://bigg.ucsd.edu/api/v2/universal/metabolites' the result is a list of dictionaries where in each dictionary the 'bigg_id', 'name' and 'model_id' is contained. However, for our processing in gapfill the 'model_id' is not important but the external database links which are not contained in the results. To get these the API could be used again which could be computationally demanding as each reaction/metabolite needs to be queried. Another solution would be to download the TXT files with requests and add the tables to the database.

One major difference between the TXT files and the API version is that from the API only the reactions/metabolites for the 'universal' model are returned. However, I compared the amount of entries for the reactions which seems to be the same.

famosab commented 1 year ago

I think the best way to handle this is to fetch the TXT file with requests each time a new run of refineGEMs is done. That is still a bit expensive but way less that fetching each entity by itself. The advantage of always fetching the TXT is that the local database will still be as good as possible. Maybe we can implement this for ModelSEED as well. There is a table in the data folder but this table is not updated at the moment.

In the MCC tool automated access to the databases is the first step. Here all the databases are downloaded and stored locally. Maybe we can adapt from some of the code?

GwennyGit commented 1 year ago

OK, then I will implement it like that. I had the ID to only fetch content from the BiGG database if a newer version is released. However, this idea will not work for the TXT files as these are not obtained via the API as far as I understood it. Regarding the ModelSEED implementation and how databases are handled in MCC I will have a look.

GwennyGit commented 1 year ago

Currently, only the BiGG and SBO tables are contained in the database. The sboann module should be working like before. For this module not much had to be changed. For the BiGG tables the access to these tables is not used/established but will be in the upcoming commits.

I inferred that @famosab obtained the TSV file for ModelSEED from the respective GitHub page. On this site https://modelseed.org/about/version the API access is mentioned, but I am not quite sure how to obtain a table similar to the TSV file. It seems to me that in MCC the file is downloaded via GitHub as well if no other path is provided. 🤔

Additionally, do we want to integrate the media CSV file as well? I just found out that pandas can directly connect to a SQL database. So we could offer the user a function - maybe in io - that provides functionality to add a medium from a CSV file and also to obtain a CSV file that could be used with e.g. CarveMe's gapfill function. What do you think?

GwennyGit commented 1 year ago

Currently, only the database set-up is done and only the sboann module uses actively the database as there were minimal changes necessary. Regarding the implementation/integration in gapfill I would do this in the gapfill branch. @famosab For the modules modelseed and growth I wanted to ask you if you could maybe do that. As you wrote these modules I think it will be easier for you to integrate the database part. For the io functions I would say we will see who of us will have time for that if that is OK for you, too.

famosab commented 1 year ago

I will review the PR and I can implement the changes into modelseed and growth. I do not have much time now, but I will have a look at your implementation in gapfill and try to adjust it to "my modules".

GwennyGit commented 1 year ago

As for one function in gapfill the required table is needed as data frame a function loading a table via the table name from the database data.db was added to io. To enable access to the ModelSEED compounds table the same function can be used. Thus, for now this is how modelseed will be connected to databases/data.db. As for modelseed only the rows with BiGG identifiers are needed an SQL query is used to extract these rows.

As the table for the media is slightly different implemented in data.db I will have to take another look on how to quickly integrate data.db with growth.

GwennyGit commented 1 year ago

Connected gapfill, modelseed and growth successfully to the database data. The only remaining task is the addition of the io functions stated in the description of this issue.

GwennyGit commented 1 year ago

All commits multiplicated commits since the last comment are the result of my attempt to rebase these commits from dev to another branch. However, this did not work and, thus, I will cherry-pick the according commits to the other branch. This will likely add the same commits again in this issue after this comment.

GwennyGit commented 1 year ago

I removed H3BO3 from the substance2db table as I did not find any BiGG, KEGG or SEED identifiers for this substance.

GwennyGit commented 1 year ago

I removed all entries for the substances EDTA, Phenol Red, Resazurin and H3BO3. These substances seem not to affect growth but are only needed for lab researchers.

2,2-Bipyridine remains for now disconnected in the medium2substance table as I was unsure whether to remove it.

GwennyGit commented 1 year ago

As 2,2'-Bipyridine was added to the SNM3 according to the paper to mimic iron deficiency, this compound was adequately integrated into the database with the according formula and database identifiers. No BiGG or KEGG identifiers for 2,2'-Bipyridine existed, so I added only identifiers from the SEED and MetaNetX namespace.

GwennyGit commented 1 year ago

@NantiaL provided the full flux table for the medium ‘Blood’ as the version from the paper ‘New workflow predicts drug targets against sars-cov-2 via metabolic changes in infected cells’ only contains the fluxes and metabolites of the medium ‘Blood’ relevant for the study.

In the case of L-Cysteine and L-Cystine the identifiers were incorrect, and, thus, the flux could not be mapped according to the identifier. In this case, I mapped the fluxes according to the provided full name. For the case of Sorbitol we have currently an identifier for L-Sorbitol and one for D-Sorbitol. Hence, I mapped the flux for Sorbitol to both of the identifiers. All other identifiers and full names could be directly mapped to exactly one identifier already in the database.

cb-Hades commented 1 year ago

In/Out functions for the database should work now, @GwennyGit .

Additionally, format_substance_table was added to allow reformatting of the substance table. The format growth_old is a first try to create a table that works with the current growth module, but has yet to be tested.

To produce an entry for the documentation, I would suggest using the 'documentation' type in load_substance_table_from_db, but that is currently only an idea.

NOTE: One reformatting step is still missing for a working conversion to the growth module, but I am on it. -> DONE, Suggestions and wishes welcome to improve/extend formatting of the output tables.

cb-Hades commented 1 year ago

Added a writer fro rst table for the medium table to the medium class, which can new be called by using export_to_file with the type rst or docs.

@GwennyGit - please check if the table is how you wanted it and if the default sizes are fitting or should be adjusted.

GwennyGit commented 9 months ago

All listed tasks are done — closing issue.

draeger-lab / refinegems

Store all necessary data in `data` #49