70% smaller cache, with all paralogs; 20% faster related genes start time

This improves cache compression ratios and parse times, yielding faster initialization for the related genes kit.

It also now stores all paralogs, instead of only up to 10 for each gene. This lays a foundation for richer paralog functionality.

Web cache size decreased from 22 MB to 8.4 MB: 70% smaller. Start time decreased from 951 ms to 762 ms: 20% faster.

Paralog cache optimizations

The paralog cache size was optimized mainly by introducing a "pointer" construct. It turns out that genes often have the same set of paralogs as other genes -- which makes sense, as paralogy is a symmetric relationship. We can thus record paralogy drastically more efficiently than before by referring to each shared list of paralogs with a simple pointer. The "pointer" is basically a gene symbol of the first gene in the paralog set. So each full set is enumerated only once, rather than N times.

This is so efficient that we can list all paralogs for all genes, rather than only the first 10 paralogs as before. Many genes have hundreds -- sometimes over 1000! -- of paralogs. These are sometimes clustered by genomic position; we could use overlap annotations to denote paralog neighborhoods.

Interaction cache optimizations

These optimizations were simpler. Previously, interaction data for all organisms was combined in one file. It used a verbose JSON structure. Splitting the big cache file into multiple smaller cache file by species, replacing objects with arrays, and condensing duplicative WikiPathways names shrank the cache quite a bit.

Timing data

Times were measured by recording console output for total times with pre-existing instrumentation, using http://localhost:8080/examples/vanilla/related-genes?q=RAD51&org=homo-sapiens&debug=true. Each timing below was measured separately; so 60 samples, not 20.

Original (v1.36.0)	Phase	Sample times (ms)
Parse paralogs	534 + 813 + 877 + 517 + 347 + 501 + 692 + 742 + 560 + 743	632
Parse interactions	953 + 882 + 798 + 823 + 836 + 859 + 947 + 991 + 904 + 840	883
Start time	946 + 901 + 915 + 1043 + 954 + 909 + 860 + 862 + 1164 + 956	951

After optimization	Phase	Sample times (ms)
Parse paralogs	730 + 623 + 380 + 352 + 346 + 643 + 686 + 759 + 601 + 669	579
Parse interactions	627 + 629 + 642 + 636 + 667 + 629 + 775 + 708 + 672 + 686	667
Start time	713 + 744 + 798 + 800 + 757 + 759 + 673 + 777 + 763 + 831	762

eweitz / ideogram