eweitz / ideogram

Chromosome visualization for the web
https://eweitz.github.io/ideogram
Other
288 stars 72 forks source link

70% smaller cache, with all paralogs; 20% faster related genes start time #302

Closed eweitz closed 2 years ago

eweitz commented 2 years ago

This improves cache compression ratios and parse times, yielding faster initialization for the related genes kit.

It also now stores all paralogs, instead of only up to 10 for each gene. This lays a foundation for richer paralog functionality.

Web cache size decreased from 22 MB to 8.4 MB: 70% smaller. Start time decreased from 951 ms to 762 ms: 20% faster.

Paralog cache optimizations

The paralog cache size was optimized mainly by introducing a "pointer" construct. It turns out that genes often have the same set of paralogs as other genes -- which makes sense, as paralogy is a symmetric relationship. We can thus record paralogy drastically more efficiently than before by referring to each shared list of paralogs with a simple pointer. The "pointer" is basically a gene symbol of the first gene in the paralog set. So each full set is enumerated only once, rather than N times.

This is so efficient that we can list all paralogs for all genes, rather than only the first 10 paralogs as before. Many genes have hundreds -- sometimes over 1000! -- of paralogs. These are sometimes clustered by genomic position; we could use overlap annotations to denote paralog neighborhoods.

Interaction cache optimizations

These optimizations were simpler. Previously, interaction data for all organisms was combined in one file. It used a verbose JSON structure. Splitting the big cache file into multiple smaller cache file by species, replacing objects with arrays, and condensing duplicative WikiPathways names shrank the cache quite a bit.

Timing data

Times were measured by recording console output for total times with pre-existing instrumentation, using http://localhost:8080/examples/vanilla/related-genes?q=RAD51&org=homo-sapiens&debug=true. Each timing below was measured separately; so 60 samples, not 20.

Original (v1.36.0) Phase Sample times (ms) Mean time (ms)
Parse paralogs 534 + 813 + 877 + 517 + 347 + 501 + 692 + 742 + 560 + 743 632
Parse interactions 953 + 882 + 798 + 823 + 836 + 859 + 947 + 991 + 904 + 840 883
Start time 946 + 901 + 915 + 1043 + 954 + 909 + 860 + 862 + 1164 + 956 951
After optimization Phase Sample times (ms) Mean time (ms)
Parse paralogs 730 + 623 + 380 + 352 + 346 + 643 + 686 + 759 + 601 + 669 579
Parse interactions 627 + 629 + 642 + 636 + 667 + 629 + 775 + 708 + 672 + 686 667
Start time 713 + 744 + 798 + 800 + 757 + 759 + 673 + 777 + 763 + 831 762
coveralls commented 2 years ago

Coverage Status

Coverage increased (+0.1%) to 87.681% when pulling 96129a001839aa9ec183d3be0f76150f77b7d362 on smaller-caches into becb2c2b625aa7e15884dafff1990eda6c929687 on master.