Closed ravila4 closed 1 year ago
The parser is working, but takes about 14 hours to run. Here's some optimizations we can do:
organism
attribute using an XSLT file before parsing. This can then allow us to cache all the queries against mygene.info for genesets with the same species, and thus avoid sending duplicate queries.Both the parser and geneset_utilities code have been reworked to query mygene.info more efficiently.
Improvements to geneset_utilites:
The MSIGDB parser now keeps track of duplicates, missing genes, and the original source ids.
Modified the parser and dumper to use the xml files, as they have more metadata, including geneset name and description: https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/7.5.1/msigdb_v7.5.1.xml
Modify the geneset utilities to allow adding multiple duplicate hits to a geneset.