Closed julesjacobsen closed 7 years ago
Fixed in #19.
One question remains: how precisely do we want to down-sample the empirical distributions for sematic similarity scores. The non-resampled data takes ~1.4GB, when resampling to 200 points, this goes down to 220MB. The question also is how to downsample. The current code in H2ScoreDistributionWriter.java
samples to every (target points)/(old count)
points. @pnrobinson @drseb
Previously, I rounded to 4 digits after the comma.
Thanks for doing this - I've added a comment on the PR too.
Phenix currently uses some ancient score distributions generated against an HPO from back in the old days. We really need these to be updatable by the user, similar to how Jannovar allows updating of the ser files.
Ideally this mechanism should be both command-line driven and also programmatically accessible as a library for other code to call it too, such as in an Exomiser build.