biothings / mygeneset.info

Apache License 2.0
5 stars 3 forks source link

Enhancement: Reduce up MSigDB upload time #71

Closed ravila4 closed 1 year ago

ravila4 commented 1 year ago

MSigDB currently takes about 6-8 hours to upload. I think it could be faster.

I made a mistake in designing the parser in that I parse out the genesets and query the gene lists for each geneset individually, relying on a hashed dictionary of previously seen genes to avoid redundant queries.

On my experience, the fastest way to query gene lists is to read the file twice. First to generate a set of all unique genes across all genesets and query them against mygene.info. On the second read, we search the hashed results and generate individual gene lists. Other data plugins are using this method, and they run much faster.

ravila4 commented 1 year ago

I looked more into this, and I think it's not really feasible without much effort. What distinguishes MSigDB from other datasets is that the downloaded xml data contains a mixmatch of genes from different species, and genes with different identifier types. All this makes it hard to batch process all the genes at once.

The parser has already been optimized by sorting based on original organism, but as far as determining the type of ID that needs to be used for lookup, it needs to be handled on a case by case basis.