dondi / GRNsight

Web app and service for modeling and visualizing gene regulatory networks.
http://dondi.github.io/GRNsight
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

yeast expression data for database #937

Closed kdahlquist closed 12 months ago

kdahlquist commented 2 years ago

Opening this issue for @ahmad00m to record tasks for preparing a new expression dataset for the back-end database.

We are going to use data from this paper: Apweiler, E., Sameith, K., Margaritis, T., Brabers, N., van de Pasch, L., Bakker, L. V., ... & Kemmeren, P. (2012). Yeast glucose pathways converge on the transcriptional regulation of trehalose biosynthesis. BMC genomics, 13(1), 1-14. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-239

We will focus on the wild type data because that is the one for which they did the timecourse. @ahmad00m should begin by reading the paper. We will then work on analyzing the data and preparing it for the database insertion.

We are roughly going to follow the project outline from the Fall 2019 Biological Databases course. Of particular interest are: https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Data_Analysis and https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Quality_Assurance

kdahlquist commented 2 years ago

This is the spreadsheet we will work from GSE33097_s257_final.xlsx

dondi commented 2 years ago

@ahmad00m has read the paper and had some clarification questions; next step is to look at the data to see whether this can be mapped to GRNsight.

dondi commented 2 years ago

@ahmad00m will need to do some clustering with stem as the next step; he will also seek the structure the file for stem C

ahmad00m commented 2 years ago

@ahmad00m installed stem and got the interface window running but when trying to browse data into stem no files were found. The address of the file was also inputted manually to check whether that would fix the problem but that didn't work either. The file was saved as (Tab-delimited) (.txt) but was not found.

Screenshot 2021-10-20 at 9 10 24 pm
kdahlquist commented 2 years ago

@ahmad00m, is the file somewhere I can grab? I'll try it on my machine before the meeting today.

ahmad00m commented 2 years ago

I attached the file here. The formatting is a bit different because we couldn't do ANOVA test, but I think it should e fine for stem. (WT1)_stem.txt

kdahlquist commented 2 years ago

I got it to work on my machine. We can troubleshoot during the meeting. However, stem doesn't work with replicates. You need to take the average of the replicate data for each time point and just load the average into stem, not the replicates.

Also note that since the data has a 0 timepoint, we can leave the default setting for normalization.

kdahlquist commented 2 years ago

We are also going to need to standardize the IDs in this file. There is a mix of standard names, systematic names, and internal SGD IDs in the file. Yeastract has a tool here: http://www.yeastract.com/formorftogene.php, although I'm not sure it will do the SGD IDs.

kdahlquist commented 2 years ago

We have discovered that there are duplicate and triplicate rows in the data that need to be removed. Some rows are unique, some are duplicated, and some are triplicated. We need to get rid of the redundant rows, and then standardize the IDs.

Also, when I opened the .txt file in Excel, it converted some of the IDs to dates.

dondi commented 2 years ago

@ahmad00m has put together a Python script that flags duplicates and we have decided to keep the duplicate row that has the systematic ID. All other rows can be discarded.

ahmad00m commented 2 years ago

@ahmad00m had finished up tidying the file, but when I try to find the standard names of each gene, I get a smaller number compared to the systematic names of the genes as if some genes are lost in the process. I have 4,926 genes for systematic names and I get 4,856 genes for the standard name. I was wondering what I would need to do to resolve this problem.

kdahlquist commented 2 years ago

If a standard name does not exist for a given systematic name, then use the systematic name for the standard name in that instance. I would spot check a few of these in SGD just to make sure that this is what's happening.

ahmad00m commented 2 years ago

I have fixed the problem. I believe there was an issue with my code. Now everything works and the numbers for both the standard name and the systematic name match. Also, I have attached the excel file and the final version of the clean data in .txt format. GSE33097_s257_final-4-original.xlsx FINALfile.txt

dondi commented 2 years ago

@ahmad00m will start committing the duplication removal script to https://github.com/dondi/GRNsight-archive under a folder within a scripts folder. In addition to the script, @ahmad00m can also commit a README.md that describes what the script does. Recommended structure is as follows:

dondi commented 2 years ago

Systematic name regex (so far): Y[A-P][LR][0-9][0-9][0-9][WwCc](-[A-Z])?

(when using grep, the parentheses and question mark need to be escaped: Y[A-P][LR][0-9][0-9][0-9][WwCc]\(-[A-Z]\)\?)

ahmad00m commented 2 years ago

@ahmad00m saved duplication scripts on GRNsight-archive repository. I tried to clean the file using grep and then remove duplicates which resulted in 4,848 genes. My guess for low number of genes is that some of the gene expression data might use names other than the systemic names. I checked for that by just running the duplicate_remover code on the original file and there were 5,569 genes which suggests some data has been lost by only selecting for systematic names. I inputed the standard names for those genes using http://www.yeastract.com/formorftogene.php website and then ran stem. I believe the next step for me would be try to figure out a way to find the systematic names for those of other names in the file and try to write a more sophisticated code for selecting systematic names from the file.

Here is the file for systematic names and no duplicates standard_system_clean_file.txt

Here is the file for no duplicate genes testfordupgene.txt

Also here is a picture of the stem result.

Screenshot 2021-11-11 at 7 34 38 am
dondi commented 2 years ago

The next step here is to get more specific information on the duplicated expression data rows:

Possible approaches:

ahmad00m commented 2 years ago

Just a quick question. if there are 2 duplicates both with systematic names, e.g. YLR391W and YLR391W-A does it matter which one to keep?

kdahlquist commented 2 years ago

Those are two different genes, keep both.

ahmad00m commented 2 years ago

@ahmad00m wrote a code to keep the preferred ID's but there is an issue with returning them. I'm hoping to resolve this issue in the meeting. Also, I determined that there are 664 unique genes with SGD ID's which need to be changed to the systematic names (they are unique values with only SGD name). So, I need to find a way to change these ID's to systematic ID.

ahmad00m commented 2 years ago
ahmad00m commented 2 years ago

@ahmad00m contacted SGD website help desk and I got a website that contains all the information (SGD ID, Systematic name, Standard name) in tab delimited text format. Here is the link to the website YeastMine. Here is the file that contains all the ID's: results.txt However, I found some gaps for standard names in the file. For now, I can use this file to replace the SGD ID's with their systematic names by modifying the script. If there is anything else I need to do please let me know.

kdahlquist commented 2 years ago

@ahmad00m, please make a list of IDs that are the exceptions you mentioned. I'd like to investigate what the problems are for those IDs that would result in gaps.

The "results.txt" file should be committed to the GRNsight archive. It should be given a more descriptive name. Put it into it's own directory named something like "source data" (I don't remember our naming conventions off the top of my head", and then make a README.md file in that directory that describes how the data were obtained, the date, and what is in the file for future reference. The data are a snapshot in time, so if we need to do this again, we should have instructions on what to do.

ahmad00m commented 2 years ago

The file attached contains the ID's with gaps for their standard names. There are 1357 of them. filewithnoGaps.txt

kdahlquist commented 2 years ago

Wow, 1357 seems like a lot, is this the total from Yeastmine or the total from your dataset (or both)? I did a spot check on the first 10 IDs in the list by looking them up directly on the SGD webpage. Some of them are designated "dubious" as in unlikely to code for a real protein, but some of them were simply "uncharacterized" meaning that no one has studied them yet. A couple had "reserved" names which is on the way to getting a real "standard" name.

I think we can safely copy over the systematic name to be the standard name for these.

I'm not going to be able to make the meeting tomorrow. Would @ahmad00m make a summary of what he has done to the dataset? I'm looking for something like:

ahmad00m commented 2 years ago

This is the total from Yeastmine. I did not use my dataset yet. I believe I can convert all the SGD ID's to systematic names and then use systematic names to to find standard names on Yeastmine web page. Then, if the standard name doesn't exist I can use the systematic name as the standard name as you suggested.

@ahmad00m will make a summary of what processes I have done on the dataset.

dondi commented 2 years ago

@ahmad00m and @dondi reviewed the status of this issue at the meeting and first resolved a few bugs and technical questions in his current code. @dondi also sketched out how @ahmad00m can use the ID-mapping file that he acquired from SGD to identify the systematic ID and/or standard name given an SGD ID. (this file has also been uploaded to GRNsight-archive)

@ahmad00m will work on these bug fixes and post a follow-up message with the summary requested by @kdahlquist

ahmad00m commented 2 years ago

@ahmad00m finished up writing the code to replace SGD ID with Systematic names. However, I found out there are 45 ID's out of 696 SGD ID's have no equivalent systematic name in the file obtained from SGD website Helpdesk. So, I can look up these ID's and change them manually. After looking up these 45 SGD ID's the file will be ready and cleaned to be used in stem.

The summary is as follows:

ahmad00m commented 2 years ago

@ahmad00m finally cleaned the original file and found 5543 genes. Then, I found the standard names using Yeastract and used the systematic names for those which didn't have standard names. I have attached the final dataset below. Also, I tried running stem using the instructions from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_9, but I got an error saying "All genes filtered". Hopefully we can trouble shoot this during the meeting.

FINALUNIQUEIDS.txt

Screenshot 2022-01-18 at 11 53 20 am
kdahlquist commented 2 years ago

@ahmad00m, there was a problem with the way you formatted the file.

Before we move onto the next step, you need to write up a protocol for all the steps you carried out to go from the original file to this one. I want to review that and follow the steps myself to make sure I can replicate your results.

After that, the next step would be to generate candidate gene regulatory networks using Yeastract. It looks like out of the 8 significant patterns, 4 are generally up before returning to baseline and 4 are generally down before returning to baseline. In terms of looking for networks, it might work to group the genes from the 4 up and 4 down clusters.

kdahlquist commented 2 years ago

Even though we don't need the standard names to run stem, we will want them. I noticed some odd standard names in the file:

ahmad00m commented 2 years ago

Here is the final summary of cleaning the data including the documentation of the steps take.

Here is the documentation of the steps taken to clean the expression data. Documentation_of_gene_expression_data.docx

Also, the code is ready to be pushed to GitHub. Should @ahmad00m upload the documentation and the codes to GRNsight-archive?

Moreover, if @kdahlquist wants to confirm the steps I can email the codes before pushing them to GRNsight-archive.

Here is the final version of the file. (It is a bit different than last one because this one includes the expression data for mitochondrial genes which was decided to be included) Unique_systematic_ID.txt

kdahlquist commented 2 years ago

@ahmad00m , you can upload the code to the GRNsight-archive. If it needs to be modified in the future, that's OK. It is preferable to keep it in the repository. Since GitHub keeps track of all versions, it's better to keep it there as opposed to having the only copy be on your computer.

ahmad00m commented 2 years ago

@ahmad00m pushed the codes to GRNsight-archive. So, they can be accessed for testing.

ahmad00m commented 2 years ago

@ahmad00m ran STEM and saved the results. I also tried analysing the results and continued up until generating the regulation matrix in YEASTRACT ;however, no matrix was created after a while and I did not get any errors either. I hope to troubleshoot this during the meeting so I can continue with visualising the model with GRNsight and determine which one would be appropriate to pursue further for modeling.

ahmad00m commented 2 years ago

@ahmad00m updated the codes and the documentation for replacing the ID's. I also added the original expression data to GRNsight-archive. Moreover, I tried to create the regulation matrix but the website doesn't return any matrices, so I'm hoping to troubleshoot that during the meeting later today.

kdahlquist commented 2 years ago

Some notes from the 2/14/22 meeting:

ahmad00m commented 2 years ago

@ahmad00m finished creating the adjacency matrices for the first four significant profiles using YEASTRACT database. Also, the new documentation containing all the steps up until visualizing the GRN on GRNsight will soon be pushed to GRNsight-archive for review.

ahmad00m commented 2 years ago

Here is the link to the complete Documentation

dondi commented 2 years ago

Follow-up wrap-up comments:

ahmad00m commented 2 years ago
dondi commented 2 years ago

Initial review of the documentation looks good; it will need a “validation test” where someone who is unfamiliar with the process seeks to follow the instructions in order to accomplish the same result. Tentatively this looks like a good match for @ahmad00m to go over with @Sarronnn, minimizing intervention until they discover something that needs to be clarified in the documentation

kdahlquist commented 12 months ago

Closing because it is complete and live in v6.0.7