Complex annotation - Githubissues

cheng-yu-zhang commented 2 years ago

Main improvements in this PR:

Manually check all 209 complex annotations in yeast8.5 based on uniport, SGD and complex portal. I applied "addDBNewGeneAnnotation.m" to correct 45 complex annotations which are wrong or incomplete.

The the corrected annotations are in the file "databasenewGPR.tsv"
The corresponding reasons are in "databasenewGPR_proof.tsv".
New genes are detailed in "DBnewRxnsGenes.tsv“".
The result of gene essentiality analysis remain the same, 0.8980.
The latest complex annotation downloaded from complex portal is in file "Yeast_complex_portal_2022.tsv"
~~The explanation is in file "explanation.docx"~~

Explanation

Yeast_complex_portal_2022.tsv is the latest complex information downloaded from complex portal. This file and complex portal website are the most import reference, and uniprot and SGD is for supplement.

First, compare the file with yeast-GEM to find the complex annotation (A) in the file and its counterpart (B) in yeast-GEM.
If A contains more complex than B, then add the extra complex. e,g. “r_0831”.
For the same complex in A and B, if A contains more subunits than B, then add the extra subunits. e.g. “r_3216”
For more complicated situation. e.g. “r_0963”, “r_0263”, “r_0886”, “r_0021”. Then uniprot and SGD are needed for further information to make sure whether a single subunit could catalyse the reaction along, whether a subunit is necessary.

I hereby confirm that I have:

[x] Tested my code with all requirements for running the model
[x] Selected develop as a target branch (top left drop-down menu)
[x] If needed, asked first in the Gitter chat room about this PR

hongzhonglu commented 2 years ago

@cheng-yu-zhang For each pull request, please summarize the detailed work that you have done so that it will be easier for other people to review it.

cheng-yu-zhang commented 2 years ago

@cheng-yu-zhang For each pull request, please summarize the detailed work that you have done so that it will be easier for other people to review it.

@hongzhonglu I haved added more details into the comments.

feiranl commented 2 years ago

Hi @cheng-yu-zhang, Thanks for this update! Nice work!

The growth test for the updated model basically remains the same with model in the devel branch. The accuracy for gene essential test also remains the same (0.89). However, two genes: YKR072C and YOR054C are now false negative (experimental_viable, model_inviable for deletion), please double check reactions associated to these two genes.

You mentioned you added 7 new genes, but according to the README file, the gene number has been changed from 1150 to 1161. Please check this.

It would be better to have a reference or a database reference for every change so that we can trace back to the annotation. This could either be an extra column of "databasenewGPR.tsv" or summaries as a table here (see below for example). It would facilitate the transparency of the model curation. @edkerk @hongzhonglu, what do you think?

For example:

List of genes removed in this version:

Genes	Related reactions	Reference
YGL119W	fill this	fill this
YGR147C	fill this	fill this

List of genes added in this version:

Genes	Related reactions	Reference
YPR165W
YFR049W
YOR253W
YGR038W
YLR350W
YBR128C
YLR211C
YLR360W
YPL120W
YFR021W
YNL054W
YGR106C
YPR170W-B

List of genes modified in this version:

Genes	Related reactions	Reference
gene

edkerk commented 2 years ago

@feiranl There should indeed be an explanation of why these curations were performed. The PR text mentions that these were manually curated by looking at different databases, but which database is then suggesting which change? Do the databases agree? Is there a conflict? Also some genes are removed, how confident are we of this?

I have rebased this PR onto the latest develop branch, so that the model files can be generated. I also refactored the code to use only RAVEN functions, following #301.

Instead of modifying existing files that were used for previous curations (databasenewGPR.tsv), it is better to make a dedicated file for this particular curation. See for instance #300 and #304, where separate folders with those files are made (here just 1 file would be sufficient).

cheng-yu-zhang commented 2 years ago

@edkerk Instead of making a new file "DBnewRxnsGenes.tsv“, which detailed the new genes, could I add another file, maybe named "databasenewGPR_proof.tsv", to explain why these curations were performed? For example:	rxnID_yeast_model	genes_yeast_model	final_GPR	reference
r_0005	YGR032W or YMR306W	YMR306W	web link or paper

edkerk commented 2 years ago

@cheng-yu-zhang

I have done some refactoring of your data and code, two of the files were mostly duplicate.
While you give detailed links to the webpages where each complex is described, it would be good to still have a higher level explanation of what curation was done. Similar as one would do for a paper/report. This can be put in the PR message.
Please see https://github.com/SysBioChalmers/yeast-GEM/pull/306#issuecomment-1135181716 for a comment on what should be in the refseq column of the gene metadata: not the nucleotide sequence, but a nucleotide NCBI identifier.
Can you show results from gene essentiality analysis? There is a function for this code/modelTests/essentialGenes.m

feiranl commented 2 years ago

Could also run the Growth Tests? This normally will run successfully, but just to make sure that we have a functional model? @hongzhonglu @edkerk @cheng-yu-zhang I think maybe it is time to have some more tests after each update to ensure the quality. Now we have essentialGenes and growth, but maybe we can have a separate flux check which can be extracted from C13 data? In that case, we know that we are making the flux prediction better or at least not worse. What do you think?

hongzhonglu commented 2 years ago

Could also run the Growth Tests? This normally will run successfully, but just to make sure that we have a functional model? @hongzhonglu @edkerk @cheng-yu-zhang I think maybe it is time to have some more tests after each update to ensure the quality. Now we have essentialGenes and growth, but maybe we can have a separate flux check which can be extracted from C13 data? In that case, we know that we are making the flux prediction better or at least not worse. What do you think?

It is very nice suggestion. More test will make sure the model prediction quality is increased consistently. @cheng-yu-zhang @feiranl

SysBioChalmers / yeast-GEM

Complex annotation #305

Main improvements in this PR:

Explanation