ekg / hhga

haplotypes genotypes and alleles example decision synthesizer
MIT License
20 stars 3 forks source link

Scaling up to 1000 genomes #15

Open nikete opened 8 years ago

nikete commented 8 years ago

We have two options here: cluster style allreduce or hogwild, we need back of the envelope calculations for which of the two is best. Assuming current machine is on a spinning disk, figure out howmuch faster hogwild on a SSD would be (about 250 times the training size of 1 robot set)

nikete commented 8 years ago

On a very applied level Hogwild has only been pain to use, it seems give that the data only grows by 100X given the lower depth, we can learn a simple enough model in spanning tree cluster mode and hogwild is not needed.

On a learning theoretic note, it remains a important open issue how to incoprorate the data from 100 0 genomes. The easiest thing is to do initial passes of learning on them and then adjust weights with a few ast passes ont he 50X data. This seems unlikely to lead to much imrpovements to the degree that the varying levels of the features accross both representations will wash out any learning that cn abe transfered. Even with clipping the number of candidates alignments, we do not at the moment have a good normalization strategy to go from depth 7 to 50

nikete commented 8 years ago

A more principled apporach is to use the data from the individual that we in both the 1000G sample and the 50X depth and truth set sample, to calibrate. Simplest thing that could work is to use single feature approach for the nonstructured case described in http://web.stanford.edu/~kuleshov/papers/nips2015.pdf