Defining gene-category annotations from raw data in .obo instead of `termdb` form (no longer provided)

alyssadai commented 2 years ago

Hi Ben,

Thanks for your work on this toolbox!

Summary

I am trying to use the DataProcessing/ scripts to generate gene to category annotations from the latest GO release 2022-01-13, but am a little stuck on how to proceed using the raw go-basic.obo file (since GO no longer provides their ontology in MySQL database form).

Details

As a first step, I've used the OboToTerm package to obtain term.txt, term2term.txt, and graph_path.txt files from my go-basic.obo file. Based on DataProcessing/GetGOTerms.m and the corresponding output GOTerms_BP.mat in your data repo, I've gathered that I can replicate this .mat file for my data fairly easily without SQL operations, by using bash text processing commands to look for biological_process terms in my term.txt file. It looks like processing GO term annotations with ReadDirectAnnotationFile.m does not require connecting to a MySQL database, so this second step seems feasible with the data I have as well.

The step I'm stuck on is running propagateHierarchy.m using just my term.txt and term2term.txt files - this script seems heavily dependent on having the data in a MySQL database/server. Any advice on how I can obtain propagated annotations (i.e. for use in enrichment analysis) using the plain table .txt files, or maybe coerce the tables into termdb format to work with the DataProcessing/ code?

Apologies if I've misunderstood any of the code - I'm still not very familiar with SQL!

For your reference, here are my OboToTerm-converted tables from go-basic.obo (processed 2022-01-13): term.txt
term2term.txt

benfulcher commented 2 years ago

Hi @alyssadai. Thanks for your interest in this and for such a detailed explanation of where you've gotten to. Apologies that I'm no longer working on this, but I know work is underway for a new and improved version soon (by someone else who has taken the starting point of this repository alot further and will release their work soon).

There are certainly ways that you might hack this together without mySQL, as propagating annotations is a common general thing people do with these datasets (e.g., I'm sure there are bioinformatics packages in python and R that do this general thing)…

In my case, adding to a local database made sense (and didn't take long), but it would be great if someone can think of an alternative approach that is compatible with the latest GO data…

alyssadai commented 2 years ago

Hi @benfulcher (@Melissa1909, also looping you in here since you inquired about this well)!

Thanks for your answer, and excited to hear about a possible new and improved version of this package in the near future.

In the meantime, I've developed a workaround for this issue that should help users convert recent GO releases into a db format compatible with sqlite versions of the commands originally used in the repo, implemented in my pull request #3 . It's not the most elegant solution for sure, but should work as a temporary fix if any folks need the functionality of GeneCategoryEnrichmentAnalysis urgently.

Hope this helps, and let me know if you have any other thoughts!

benfulcher commented 2 years ago

FYI: This is the new repo: https://github.com/LeonDLotter/ABAnnotate

benfulcher / GeneCategoryEnrichmentAnalysis

Defining gene-category annotations from raw data in .obo instead of `termdb` form (no longer provided) #2

Summary

Details