ERT1 not in network database - Githubissues

dondi / GRNsight

Web app and service for modeling and visualizing gene regulatory networks.

http://dondi.github.io/GRNsight

BSD 3-Clause "New" or "Revised" License

17 stars 8 forks source link

ERT1 not in network database #1106

Open kdahlquist opened 3 months ago

kdahlquist commented 3 months ago

We are going to start using GRNsight in BIOL 367 this week. I was writing the protocol and had occasion to look up "ERT1" in the "Load from database" GRN. It was not found. Can we check to see that it is not in the network database? Also, will it be there in the 2024 update?

I then tried to look it up by its systematic name "YBR239C" and got an error saying that it did not conform to the naming convention it was expecting. But I was unable to reproduce this.

dondi commented 3 months ago

Let’s audit this—we can look at both the database directly and also the downloaded original files; the latter can be supplied by @ntran18

dondi commented 3 months ago

A quick look at our archived data indicates that ERT1 was in the original database load from fall 2021 but appears to have been dropped in spring 2022:

postgres=> \dn
             List of schemas
             Name             |  Owner   
------------------------------+----------
 fall2021                     | postgres
 gene_expression              | postgres
 gene_regulatory_network      | postgres
 protein_protein_interactions | postgres
 public                       | postgres
 settings                     | postgres
 spring2022_network           | postgres
(7 rows)

postgres=> set search_path=fall2021;
SET
postgres=> \dt
                 List of relations
  Schema  |        Name         | Type  |  Owner   
----------+---------------------+-------+----------
 fall2021 | degradation_rate    | table | postgres
 fall2021 | expression          | table | postgres
 fall2021 | expression_metadata | table | postgres
 fall2021 | gene                | table | postgres
 fall2021 | production_rate     | table | postgres
 fall2021 | ref                 | table | postgres
(6 rows)

postgres=> \d gene
                        Table "fall2021.gene"
     Column      |       Type        | Collation | Nullable | Default 
-----------------+-------------------+-----------+----------+---------
 gene_id         | character varying |           | not null | 
 display_gene_id | character varying |           |          | 
 species         | character varying |           |          | 
 taxon_id        | character varying |           | not null | 
Indexes:
    "gene_pkey" PRIMARY KEY, btree (gene_id, taxon_id)
Referenced by:
    TABLE "degradation_rate" CONSTRAINT "degradation_rate_gene_id_fkey" FOREIGN KEY (gene_id, taxon_id) REFERENCES gene(gene_id, taxon_id)
    TABLE "expression" CONSTRAINT "expression_gene_id_fkey" FOREIGN KEY (gene_id, taxon_id) REFERENCES gene(gene_id, taxon_id)
    TABLE "production_rate" CONSTRAINT "production_rate_gene_id_fkey" FOREIGN KEY (gene_id, taxon_id) REFERENCES gene(gene_id, taxon_id)

postgres=> select gene_id, display_gene_id from gene where gene_id='YBR239C' or display_gene_id='ERT1';
 gene_id | display_gene_id 
---------+-----------------
 YBR239C | ERT1
(1 row)

postgres=> \dn
             List of schemas
             Name             |  Owner   
------------------------------+----------
 fall2021                     | postgres
 gene_expression              | postgres
 gene_regulatory_network      | postgres
 protein_protein_interactions | postgres
 public                       | postgres
 settings                     | postgres
 spring2022_network           | postgres
(7 rows)

postgres=> set search_path=spring2022_network;
SET
postgres=> \dt
                List of relations
       Schema       |  Name   | Type  |  Owner   
--------------------+---------+-------+----------
 spring2022_network | gene    | table | postgres
 spring2022_network | network | table | postgres
 spring2022_network | source  | table | postgres
(3 rows)

postgres=> select gene_id, display_gene_id from gene where gene_id='YBR239C' or display_gene_id='ERT1';
 gene_id | display_gene_id 
---------+-----------------
(0 rows)

dondi commented 3 months ago

We should check the original YeastMine downloads for the presence of ERT1 and proceed based on what we find. If ERT1 is in the downloads, then we have a lurking bug in our database scripts which prevent this gene from being included in the database; if we do not find ERT1 in the downloads, then this might be a YeastMine issue

dondi commented 3 months ago

@ntran18 found an off-by-one column issue while investigating this one—it may or may not be related, but should also be fixed nonetheless. @kdahlquist will look at the YeastMine downloads to track down what might have happened with ERT1

ntran18 commented 3 months ago

This file is all the genes in network table all_gene.csv This ifile contains all the protein in the PPI table protein.csv This file contains all the gene in the PPI table gene.csv

dondi commented 3 months ago

I searched the files and ERT1 is present in gene.csv but not _allgene.csv. Might this be a lead? Based on line count, _allgene.csv has 6514 records whereas gene.csv has 6715—that’s 201 more

If the database scripts are only loading genes from _allgene.csv, then this will account for why ERT1 (and potentially 200 more genes) is missing. Should we instead be loading the union of the gene files into the database?

dondi commented 3 months ago

A few questions are emerging based on this discovery; some may be answerable with a review of the code while others may need investigation of the data. The current BioDB class can do some of this as part of their final assignment:

How exactly are _allgene.csv and gene.csv derived and used by our database scripts? …this can be looked at via code inspection. Also, more descriptive filenames can be used
What’s the feasibility of revising our gene-loading code so that it loads either the union of these files or we should do a fresh query that unconditionally downloads all genes—this latter is what we’re actually after when loading genes into our database
How does our app behave when activating node coloring when there is a gene in the network that doesn’t have expression data? This will require some database querying in order to identify such genes, then we can test them in the app

dondi commented 3 months ago

@ntran18 observes that the gene tables for GRNs and PPIs are distinct as of now. The GRN gene table appears to have an additional column over the PPI gene table. Ideally, we have a unified gene table that contains all of the genes in YeastMine, but in order to get there, we will need a better understanding of the current database code and content

It was also observed that the PPI database dropdown includes the 2024 import, which was not expected. This is another side issue to track down

ntran18 commented 2 months ago

I met with Dondi on Wednesday to discuss a solution to this problem. We decided to create 3 duplicated gene tables for expression, network, and protein-protein interactions by creating a union gene table.

Network, expression, and protein-protein interactions might have different gene tables because they have different queries for Yeastmine API. Thus, to understand more about the cause, I have to research Yeastmine API and how it works. However, I was sick last week, so I couldn't do any updates this week except for fixing the off-by-one column issue.

dondi commented 2 months ago

With PR #1111 merged, we will need a database reload to test the off-by-one fix. @ntran18 can choose whether to try this first before moving on to the unioned gene tables or—because this will incur a reload of the 2024 data sets—whether to try a full load after the union itself is done

dondi commented 2 months ago

Intermediate plan: before going into the Intermine API logic, the gene tables can be unioned after the fact for now. Further analysis can be the next step

ntran18 commented 2 months ago

The current script is able to load everything from a fresh start but not update the database. combine_all_genes.csv contains the logic to combine all genes from expression, network, and protein-protein interactions.

Currently the pipeline to load database from fresh start is: 1/ Create schemas 2/ Load schema structures to the database 3/ Run preprocessing.py in expression-database/scripts 4/ Run generate_network.py for network-database/scripts 5/ Run generate_network.py for protein-protein-database/scripts 6/ After getting all genes from expression, network, and protein-protein-interactions, run combine_all_genes.csv to create union-genes.csv. 7/ Run loader.py to load everything to the database beside settings and public

ntran18 commented 2 months ago

Here is the union gene table. union_genes.csv

kdahlquist commented 2 months ago

@ntran18 will post the union table and @kdahlquist will double-check it against the YeastMine Feature Type-->Genes Query to make sure that all 6511 genes are in our database. Our database has 7098 records in the union table. It's OK to have more genes, but we want to make sure that we have all 6511.

ntran18 commented 2 months ago

I have to write scripts for updating the network and protein-protein databases, create union missing-genes and updating-gene tables, and create a file for loader-update.py that will update both the protein and network databases. I also have to update readME.md to update how to run it.

kdahlquist commented 2 months ago

I haven't yet done a comparison with the YeastMine data, but I did a visual inspection of the union table and found a number of issues:

There are actually 7097 records because field names are the first row.
There are 329 records that have "none" as the display name. They should all have display names. If they don't have their own standard name (display name), then the systematic name (gene ID) should be used as the display name.
There are 22 records that have an issue with the Gene ID (systematic name). I put notes in a notes field. Many of these have a "/" ID that needs to be either removed or separated. Others have other comments. I've attached the file. union_genes_2024-04-16_with-notes.xlsx

I'll try to find time to do the YeastMine comparison later.

ntran18 commented 2 months ago

I did notice about "None" display name for both network and protein-protein interactions database. There might be a chance that our production database also have some value None for display name too.

ntran18 commented 2 months ago

There are some problems when I create union gene table.

A lot of genes from the protein table have a None display name, but in the network table, the same gene would have a display name. Eg. YMR295C. Another case is when the protein table would have the gene id and display name ID different from each other, but the network table would have the same display name id with gene id. Eg. SBH1. I don't know which one is the correct one.

kdahlquist commented 2 months ago

The gene id is equivalent to the systematic name in SGD, and the display name id is equivalent to the standard name in SGD. So for the example of SBH1:

YER087C-B is the gene id (systematic name)
SBH1 is the display gene id (standard name)

In the history of SGD, all genes were given a systematic name because it literally encodes the position on the chromosome:

"Y" stands for "yeast"
"A-O" refer to each chromosome where A is chromosome 1, B is chromosome 2, etc.
"R" or "L" refers to whether the gene location is to the left (short arm) or right (long arm) of the centromere.
"###" refers to the order the gene appears counting from the centromere outward.
"W" or "C" refers to which strand the gene is encoded on (stands for "Watson" or "Crick")
"-A", "-B", "-C" is optional. This occurs when they found a new gene in between two other genes that were previously annotated. They didn't want to renumber the genes, so they found a way to create a systematic name that would indicate it is in between two other genes.

Not all genes have a separate standard name (display gene id). Standard names are assigned by a committee to be (somewhat) meaningful names. They take the form of three letters and a number. There are some rare examples that do not follow this rule. For example one standard name has a ' character and another has a ,

If a gene does not have a standard name, then the systematic name becomes the standard name and they will be the same.

If you find an example where in one case both the gene id and display gene id are the same, they should both be systematic names. If one dataset was later than the other, the gene could have been assigned a standard name in the newer dataset.

In all cases, SGD should be the final authority on which is correct. Individual genes can be looked up at www.yeastgenome.org. Alternately, we could pull down the entire list of genes from YeastMine and compare to what we have: https://yeastmine.yeastgenome.org/yeastmine/begin.do

When I look up YMR295C, SGD says it should have a display name of GSR1.

There should be no case where the display gene id is "none". I think the best thing to do to populate that list is to refer to a gene list from YeastMine to correct that.

dondi commented 2 months ago

We will table the full union work for after this semester due to what’s involved; we’ll explore the SQL UNION command in order to make the database do the heavy lifting plus also remove duplicates automatically. The premise to doing that, though, is to make sure that the GRN and PPI gene tables have been normalized into having the corresponding values (e.g., correct IDs, etc.)

We also checked to see if the off-by-one fix needs to be deployed and it looks like it doesn’t, but that isn’t consistent with the commit history that we looked at— @dondi will look into how this file is used in order to get a conclusive picture of the bug’s impact

dondi commented 2 months ago

So the immediate goal for now is to ensure that @ntran18’s code refactor is indeed functional and we can close the semester with that as the final accomplishment