yeast expression data for database

kdahlquist commented 2 years ago

Opening this issue for @ahmad00m to record tasks for preparing a new expression dataset for the back-end database.

We are going to use data from this paper: Apweiler, E., Sameith, K., Margaritis, T., Brabers, N., van de Pasch, L., Bakker, L. V., ... & Kemmeren, P. (2012). Yeast glucose pathways converge on the transcriptional regulation of trehalose biosynthesis. BMC genomics, 13(1), 1-14. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-239

We will focus on the wild type data because that is the one for which they did the timecourse. @ahmad00m should begin by reading the paper. We will then work on analyzing the data and preparing it for the database insertion.

We are roughly going to follow the project outline from the Fall 2019 Biological Databases course. Of particular interest are: https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Data_Analysis and https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Quality_Assurance

kdahlquist commented 2 years ago

This is the spreadsheet we will work from GSE33097_s257_final.xlsx

dondi commented 2 years ago

@ahmad00m has read the paper and had some clarification questions; next step is to look at the data to see whether this can be mapped to GRNsight.

dondi commented 2 years ago

@ahmad00m will need to do some clustering with stem as the next step; he will also seek the structure the file for stem C

ahmad00m commented 2 years ago

@ahmad00m installed stem and got the interface window running but when trying to browse data into stem no files were found. The address of the file was also inputted manually to check whether that would fix the problem but that didn't work either. The file was saved as (Tab-delimited) (.txt) but was not found.

kdahlquist commented 2 years ago

@ahmad00m, is the file somewhere I can grab? I'll try it on my machine before the meeting today.

ahmad00m commented 2 years ago

I attached the file here. The formatting is a bit different because we couldn't do ANOVA test, but I think it should e fine for stem. (WT1)_stem.txt

kdahlquist commented 2 years ago

I got it to work on my machine. We can troubleshoot during the meeting. However, stem doesn't work with replicates. You need to take the average of the replicate data for each time point and just load the average into stem, not the replicates.

Also note that since the data has a 0 timepoint, we can leave the default setting for normalization.

kdahlquist commented 2 years ago

We are also going to need to standardize the IDs in this file. There is a mix of standard names, systematic names, and internal SGD IDs in the file. Yeastract has a tool here: http://www.yeastract.com/formorftogene.php, although I'm not sure it will do the SGD IDs.

kdahlquist commented 2 years ago

We have discovered that there are duplicate and triplicate rows in the data that need to be removed. Some rows are unique, some are duplicated, and some are triplicated. We need to get rid of the redundant rows, and then standardize the IDs.

Also, when I opened the .txt file in Excel, it converted some of the IDs to dates.

dondi commented 2 years ago

@ahmad00m has put together a Python script that flags duplicates and we have decided to keep the duplicate row that has the systematic ID. All other rows can be discarded.

ahmad00m commented 2 years ago

@ahmad00m had finished up tidying the file, but when I try to find the standard names of each gene, I get a smaller number compared to the systematic names of the genes as if some genes are lost in the process. I have 4,926 genes for systematic names and I get 4,856 genes for the standard name. I was wondering what I would need to do to resolve this problem.

kdahlquist commented 2 years ago

If a standard name does not exist for a given systematic name, then use the systematic name for the standard name in that instance. I would spot check a few of these in SGD just to make sure that this is what's happening.

ahmad00m commented 2 years ago

I have fixed the problem. I believe there was an issue with my code. Now everything works and the numbers for both the standard name and the systematic name match. Also, I have attached the excel file and the final version of the clean data in .txt format. GSE33097_s257_final-4-original.xlsx FINALfile.txt

dondi commented 2 years ago

@ahmad00m will start committing the duplication removal script to https://github.com/dondi/GRNsight-archive under a folder within a scripts folder. In addition to the script, @ahmad00m can also commit a README.md that describes what the script does. Recommended structure is as follows:

GRNsight-archive
- documents
- scripts
  - duplicate_expression_remover
    - wt_stemtest.py
    - (other files)
    - README.md

dondi commented 2 years ago

Systematic name regex (so far): Y[A-P][LR][0-9][0-9][0-9][WwCc](-[A-Z])?

(when using grep, the parentheses and question mark need to be escaped: Y[A-P][LR][0-9][0-9][0-9][WwCc]\(-[A-Z]\)\?)

ahmad00m commented 2 years ago

@ahmad00m saved duplication scripts on GRNsight-archive repository. I tried to clean the file using grep and then remove duplicates which resulted in 4,848 genes. My guess for low number of genes is that some of the gene expression data might use names other than the systemic names. I checked for that by just running the duplicate_remover code on the original file and there were 5,569 genes which suggests some data has been lost by only selecting for systematic names. I inputed the standard names for those genes using http://www.yeastract.com/formorftogene.php website and then ran stem. I believe the next step for me would be try to figure out a way to find the systematic names for those of other names in the file and try to write a more sophisticated code for selecting systematic names from the file.

Here is the file for systematic names and no duplicates standard_system_clean_file.txt

Here is the file for no duplicate genes testfordupgene.txt

Also here is a picture of the stem result.

dondi commented 2 years ago

The next step here is to get more specific information on the duplicated expression data rows:

Modify the duplicate remover so that it chooses to keep, preferentially:
- Rows with systematic ID
- Rows with standard name
- Rows with SGD ID only
Counts of the latter two rows should be determined, so that we know the amount of work involved in mapping the SGD ID-only rows to the systematic ID
Once this is done, we should then have a full non-duplicated file, all of which are keyed by systematic ID

Possible approaches:

Build a new list while preferentially tracking the IDs found
Build multiple lists depending on the matching ID
Build a dictionary using a tuple-ized version of the expression data as key where the value is the list of IDs under which that expression data was found

ahmad00m commented 2 years ago

Just a quick question. if there are 2 duplicates both with systematic names, e.g. YLR391W and YLR391W-A does it matter which one to keep?

kdahlquist commented 2 years ago

Those are two different genes, keep both.

ahmad00m commented 2 years ago

@ahmad00m wrote a code to keep the preferred ID's but there is an issue with returning them. I'm hoping to resolve this issue in the meeting. Also, I determined that there are 664 unique genes with SGD ID's which need to be changed to the systematic names (they are unique values with only SGD name). So, I need to find a way to change these ID's to systematic ID.

ahmad00m commented 2 years ago

The script needs to be debugged to reduce the ID to a single one
Look of the 58 systematic names to check whether they are duplicates or the same gene and then write the correct ID
Ask the SGD help desk for a tool to convert the SGD ID to the systematic name

ahmad00m commented 2 years ago

@ahmad00m contacted SGD website help desk and I got a website that contains all the information (SGD ID, Systematic name, Standard name) in tab delimited text format. Here is the link to the website YeastMine. Here is the file that contains all the ID's: results.txt However, I found some gaps for standard names in the file. For now, I can use this file to replace the SGD ID's with their systematic names by modifying the script. If there is anything else I need to do please let me know.

kdahlquist commented 2 years ago

@ahmad00m, please make a list of IDs that are the exceptions you mentioned. I'd like to investigate what the problems are for those IDs that would result in gaps.

The "results.txt" file should be committed to the GRNsight archive. It should be given a more descriptive name. Put it into it's own directory named something like "source data" (I don't remember our naming conventions off the top of my head", and then make a README.md file in that directory that describes how the data were obtained, the date, and what is in the file for future reference. The data are a snapshot in time, so if we need to do this again, we should have instructions on what to do.

ahmad00m commented 2 years ago

The file attached contains the ID's with gaps for their standard names. There are 1357 of them. filewithnoGaps.txt

kdahlquist commented 2 years ago

Wow, 1357 seems like a lot, is this the total from Yeastmine or the total from your dataset (or both)? I did a spot check on the first 10 IDs in the list by looking them up directly on the SGD webpage. Some of them are designated "dubious" as in unlikely to code for a real protein, but some of them were simply "uncharacterized" meaning that no one has studied them yet. A couple had "reserved" names which is on the way to getting a real "standard" name.

I think we can safely copy over the systematic name to be the standard name for these.

I'm not going to be able to make the meeting tomorrow. Would @ahmad00m make a summary of what he has done to the dataset? I'm looking for something like:

total records in the unprocessed dataset
total of unique records in processed dataset
total in processed dataset that did not have standard name and had to use the systematic name for that.

ahmad00m commented 2 years ago

This is the total from Yeastmine. I did not use my dataset yet. I believe I can convert all the SGD ID's to systematic names and then use systematic names to to find standard names on Yeastmine web page. Then, if the standard name doesn't exist I can use the systematic name as the standard name as you suggested.

@ahmad00m will make a summary of what processes I have done on the dataset.

dondi commented 2 years ago

@ahmad00m and @dondi reviewed the status of this issue at the meeting and first resolved a few bugs and technical questions in his current code. @dondi also sketched out how @ahmad00m can use the ID-mapping file that he acquired from SGD to identify the systematic ID and/or standard name given an SGD ID. (this file has also been uploaded to GRNsight-archive)

@ahmad00m will work on these bug fixes and post a follow-up message with the summary requested by @kdahlquist

ahmad00m commented 2 years ago

@ahmad00m finished up writing the code to replace SGD ID with Systematic names. However, I found out there are 45 ID's out of 696 SGD ID's have no equivalent systematic name in the file obtained from SGD website Helpdesk. So, I can look up these ID's and change them manually. After looking up these 45 SGD ID's the file will be ready and cleaned to be used in stem.

The summary is as follows:

total records in the unprocessed dataset: 12,983
total of unique records in processed dataset: Around 5,599
total in processed dataset that did not have standard name and had to use the systematic name for that: I believe I can find all the standard names from Yeastract website. (To avoid further confusion I will not be using the file obtained from SGD Helpdesk to transform systematic names to standard names but rather obtain the ID's directly from Yeastract website.

ahmad00m commented 2 years ago

@ahmad00m finally cleaned the original file and found 5543 genes. Then, I found the standard names using Yeastract and used the systematic names for those which didn't have standard names. I have attached the final dataset below. Also, I tried running stem using the instructions from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_9, but I got an error saying "All genes filtered". Hopefully we can trouble shoot this during the meeting.

FINALUNIQUEIDS.txt

kdahlquist commented 2 years ago

@ahmad00m, there was a problem with the way you formatted the file.

"SPOT" needs to just be an index of 1 to 5544
"Gene Symbol" is actually the systematic name for yeast, e.g., YAL001C
- Note that nomenclature can vary widely and while we try to use the correct names for things, not everyone does
- Note that the actual standard name does not need to be included in the file you use for stem
I made these changes and was able to run the file.

Before we move onto the next step, you need to write up a protocol for all the steps you carried out to go from the original file to this one. I want to review that and follow the steps myself to make sure I can replicate your results.

After that, the next step would be to generate candidate gene regulatory networks using Yeastract. It looks like out of the 8 significant patterns, 4 are generally up before returning to baseline and 4 are generally down before returning to baseline. In terms of looking for networks, it might work to group the genes from the 4 up and 4 down clusters.

kdahlquist commented 2 years ago

Even though we don't need the standard names to run stem, we will want them. I noticed some odd standard names in the file:

e.g., ATSÊ1.00
You should always do a "visual inspection" of the file to see if there are any obvious problems. Just open and scroll down. You can see the issue on row 18.

ahmad00m commented 2 years ago

Here is the final summary of cleaning the data including the documentation of the steps take.

The question about including the mitochondrial gene expression data was proposed during the meeting and it was decided to be included in the final version of the cleaned file.
Also, the decision was made to use the 0m expression data and normalize data option in stem for clustering gene expressions.
A final confirmation test was on the original file containing around 13,000 genes to determine the unique expression data regardless of of their ID's which confirmed the total UNIQUE expression data of 5,569 genes.
There is one gene that could not be found on YEASTRACT website which represents a small nucleolar RNA that is required from pre-mRNA processing. So, my question is to whether include this data in the final version of expression data. YNCG0013W is the name if this particular gene.
Other than that, the ID's in expression data are sorted and ready for phase two.

Here is the documentation of the steps taken to clean the expression data. Documentation_of_gene_expression_data.docx

Also, the code is ready to be pushed to GitHub. Should @ahmad00m upload the documentation and the codes to GRNsight-archive?

Moreover, if @kdahlquist wants to confirm the steps I can email the codes before pushing them to GRNsight-archive.

Here is the final version of the file. (It is a bit different than last one because this one includes the expression data for mitochondrial genes which was decided to be included) Unique_systematic_ID.txt

kdahlquist commented 2 years ago

@ahmad00m , you can upload the code to the GRNsight-archive. If it needs to be modified in the future, that's OK. It is preferable to keep it in the repository. Since GitHub keeps track of all versions, it's better to keep it there as opposed to having the only copy be on your computer.

ahmad00m commented 2 years ago

@ahmad00m pushed the codes to GRNsight-archive. So, they can be accessed for testing.

ahmad00m commented 2 years ago

@ahmad00m ran STEM and saved the results. I also tried analysing the results and continued up until generating the regulation matrix in YEASTRACT ;however, no matrix was created after a while and I did not get any errors either. I hope to troubleshoot this during the meeting so I can continue with visualising the model with GRNsight and determine which one would be appropriate to pursue further for modeling.

ahmad00m commented 2 years ago

@ahmad00m updated the codes and the documentation for replacing the ID's. I also added the original expression data to GRNsight-archive. Moreover, I tried to create the regulation matrix but the website doesn't return any matrices, so I'm hoping to troubleshoot that during the meeting later today.

kdahlquist commented 2 years ago

Some notes from the 2/14/22 meeting:

the problem with YEASTRACT was that there were leading spaces on the gene lists that @ahmad00m input into the regulation matrix tool. Deleting the spaces fixed the problem.
@ahmad00m will generate a total of four networks from YEASTRACT from the first four significant profiles found in stem.
he will use the setting DNA binding evidence AND expression evidence to increase the stringency and decrease the overall number of edges.
The target number of genes in the network is 15. He should take the top 15 transcription factor hits to generate the network and check to see if they are all connected. If they are, then he's done. If not, he can add/subtract genes to get a connected network of ~15
The output from YEASTRACT needs to be formatted to be compatible with GRNsight. The data needs to be transposed, alphabatized right to left and top to bottom, and the "p" needs to be removed from the gene names. Cell A1 needs to say "cols regulators/rows targets".
@ahmad00m will work with @Onariaginosa to generate 4 alternate networks with the same genes from the SGD database she is making. We expect there to potentially be fewer edges from the SGD data, based on work done a couple years ago.
@kdahlquist will review the documentation for data processing (probably at the end of the week.)
@ahmad00m needs to go back and make sure that he's got all the screenshots and stem data from the run that we are using.

ahmad00m commented 2 years ago

@ahmad00m finished creating the adjacency matrices for the first four significant profiles using YEASTRACT database. Also, the new documentation containing all the steps up until visualizing the GRN on GRNsight will soon be pushed to GRNsight-archive for review.

ahmad00m commented 2 years ago

Here is the link to the complete Documentation

dondi commented 2 years ago

Follow-up wrap-up comments:

Seek to port the .docx to Markdown (.md) for easier viewing and editing
Rearrange scripts folder to reflect the dataset targeted by a particular set of scripts as, e.g., authorname-year-data

ahmad00m commented 2 years ago

The documentation is now available in Markdown format
The Scripts folder has now been rearranged and placed in a more descriptive directory

dondi commented 2 years ago

Initial review of the documentation looks good; it will need a “validation test” where someone who is unfamiliar with the process seeks to follow the instructions in order to accomplish the same result. Tentatively this looks like a good match for @ahmad00m to go over with @Sarronnn, minimizing intervention until they discover something that needs to be clarified in the documentation

kdahlquist commented 12 months ago

Closing because it is complete and live in v6.0.7

dondi / GRNsight

yeast expression data for database #937