biothings / mygeneset.info

Apache License 2.0
5 stars 3 forks source link

Switch to parsing XML files from MSIGDB to get more metadata #48

Closed ravila4 closed 1 year ago

ravila4 commented 2 years ago

Modified the parser and dumper to use the xml files, as they have more metadata, including geneset name and description: https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/7.5.1/msigdb_v7.5.1.xml

Modify the geneset utilities to allow adding multiple duplicate hits to a geneset.

ravila4 commented 2 years ago

The parser is working, but takes about 14 hours to run. Here's some optimizations we can do:

ravila4 commented 1 year ago

Both the parser and geneset_utilities code have been reworked to query mygene.info more efficiently.

Improvements to geneset_utilites:

The MSIGDB parser now keeps track of duplicates, missing genes, and the original source ids.