Switch to parsing XML files from MSIGDB to get more metadata

biothings / mygeneset.info

Apache License 2.0

5 stars 3 forks source link

Closed ravila4 closed 1 year ago

ravila4 commented 2 years ago

Modified the parser and dumper to use the xml files, as they have more metadata, including geneset name and description: https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/7.5.1/msigdb_v7.5.1.xml

Modify the geneset utilities to allow adding multiple duplicate hits to a geneset.

ravila4 commented 2 years ago

The parser is working, but takes about 14 hours to run. Here's some optimizations we can do:

Sort the genesets in the source XML file by the organism attribute using an XSLT file before parsing. This can then allow us to cache all the queries against mygene.info for genesets with the same species, and thus avoid sending duplicate queries.
Determine the id type in each geneset, (e.g. entrez, ensembl.gene, symbol) and submit more specific queries, instead of querying against multiple fields.

ravila4 commented 1 year ago

Both the parser and geneset_utilities code have been reworked to query mygene.info more efficiently.

Improvements to geneset_utilites:

Now, it is also possible to supply more than two lists of genes to retry failed gene queries.
Duplicate hits are now stored correctly
The query cache is now working correctly, and should minimize the number of requests needed.

The MSIGDB parser now keeps track of duplicates, missing genes, and the original source ids.