MoseleyBioinformaticsLab / GOcats

A tool for categorizing Gene Ontology into subgraphs of user-defined emergent concepts
Other
7 stars 2 forks source link

Using gocats to categorize a list of genes #10

Closed radusuciu closed 5 years ago

radusuciu commented 5 years ago

I've followed the tutorial and after defining some custom categories, and using goa_human.gaf obtained from: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/, I have generated a mapped_GAF file. However, it is unclear to me how to translate this back to my custom categories.

My end goal is to take a gene list (subset of genes found in goa_human.gaf) and see how the various entries distribute across my custom categories.

Thanks for working on this!

ehinderer commented 5 years ago

Hi radusuciu,

The mapped_GAF file that is produced contains the same lines from the original goa_human.gaf file except that for each gene (each line), the associated GO term (column 5) replaces the directly annotated, specific GO term to the GO term associated with one of your chosen categories (subgraph). Refer to subgraph_report.txt to see which GO terms are associated with each one of your categories. If one of the GO terms associated with a gene maps to multiple categories that you created, multiple lines will have been added to maped_gaf, each one representing a mapping to one of your categories. NOTE Some genes may not have been mapped to any of your categories, and these will be listed in the unmapped_genes file.

I believe your use case is common enough to warrant a function to be added to GOcats as part of the reporting of categorize_dataset and this will likely be added in the future.

For the time being, if you're familiar enough with python scripting, I recommend reading in the mapped_gaf file using python's csv module using a tab delimiter (this is what the GAF format uses). From there, you can make a default dict to count every line for each unique column 5 (mapped GO term category). This will tell you how many genes each mapped GO term (your chosen category) is associated with.

If you're not sure how to go about this, I can generate a script that I think will do what you're wanting.

Please let me know if there's anything else you need help with!

radusuciu commented 5 years ago

@ehinderer Thank you, the part I was missing was mapping back GO terms to my custom categories using subgraph_report.txt. I think I understand now though. For example, I made a kinase category which is listed in subgraph_report.txt as follows:

-------------------------
kinase
Subgraph relationships: {'is_a': 3000, 'regulates': 56, 'has_part': 68, 'positively_regulates': 24, 'negatively_regulates': 40}
Seeded size: 332
Representative node: ['kinase activity']
Nodes added: 154
Non-subgraph hits (orphans): 79
Total nodes: 407

So, any entry in my mapped output that gocat would mark as a kinase, should have the GO ID associated with "kinase activity" in column 4: GO:0016301. Is this correct? If my understanding is sound, it seems trivial to translate these to category counts as you described.

I do have another question (I can make a separate issue if you desire) which pertains to input data for mapping. My use case, which is probably relatively common (though maybe not interesting) is to place a gene list into a number of broad categories, for example: enzyme, kynase, receptor. What I have on hand is a list of uniprot ids which I can convert to any other identifier. Is downloading the GO Annotation file from the EBI (linked in my original post), and filtering it to only contain entries corresponding to my list of uniprot ids a sensible approach?

ehinderer commented 5 years ago

Yes that is correct. Your "kinase" category is represented by the GO term, "Kinase activity." Gene annotations in the mapped_GAF that are related to "kinase" will be listed as GO:0016031. Take care in specifying that column though, it is the 5th column (list index 4). The GAF format has a column that is rarely used and it can throw you off if you're just eyeballing it.

For your second question, I would say that is a valid approach. I've used that same GAF (though a different version) from the EBI for many collaborations that require looking up GO terms for lists of Uniprot IDs as well as for my proof-of-concept experiments for the manuscript I wrote associated with this software. I'm not sure why you would need to filter the GAF down to those genes in your gene list, though. I'm assuming you just need to look up GO terms relevant to the genes in your list, and then report which categories they fall under, right? Regardless, you'll want to make note of the exact version of that GAF, including release date. Gene annotations are subject to being updated, and that means your results can change overtime!

radusuciu commented 5 years ago

Understood, thank you!

ehinderer commented 5 years ago

No problem, glad I could help!