biothings / mygeneset.info

Apache License 2.0
5 stars 3 forks source link

Add mouse genesets to MSigGDB #60

Closed ravila4 closed 1 year ago

ravila4 commented 1 year ago

MSigDB released a new dataset of mouse genesets last month: http://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp We should update the MSigDB plugin to fetch and parse these files.

ravila4 commented 1 year ago

Tasks required:

  1. Fix get_remote_version function to parse the correct version number from the release notes page: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/MSigDB_Latest_Release_Notes

  2. Update the downloaded files, and handle unzipping and transforming the contents of the xml file: Human file: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2022.1.Hs/msigdb_v2022.1.Hs_files_to_download_locally.zip Mouse file: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2022.1.Mm/msigdb_v2022.1.Mm_files_to_download_locally.zip

  3. Update the parser and uploader to run on the new mouse and human XML files.

A couple of notes on the data in the new genesets:

The 15918 gene sets in the Mouse Molecular Signatures Database (MSigDB) are divided into 6 major collections, and several sub-collections.

Regarding these collections, although most of the genes are indeed mouse genes, some of them are ortholog-converted from human to mouse.

An orthology converted form to aid in initial exploratory analysis of mouse datasets utilizing orthology mappings to MGI IDs provided by the Mouse Genome Informatics (MGI) institute at The Jackson Laboratory.

The XML data provides the original gene id, as well as the orthology-converted one. The way the parser currently handles genesets from MSigDB is to use the organism taxid of the original (upstream and pre-ortholog conversion) identifiers for lookup and to assign the geneset taxid. This means that even though the downloaded dataset is a "mouse" collection, in our database, it will be ingested as a combination of mouse and human genesets. Similarly, the "human" geneset collection from msigdb currently contains orthology-converted genesets from multiple species, including mouse, rat, rhesus monkey, and zebrafish.

To prevent confusion, I plan to add a new msigdb-specific metadata field to label whether the dataset came from the human or mouse database.

ravila4 commented 1 year ago

After closer inspection, I have decided not to use the mouse data after all. You can read more of the reasoning on: https://github.com/biothings/mygeneset.info/pull/74