Update i/o for huge files

aimalz commented 3 years ago

@joezuntz pointed out that RAIL will soon encounter a serious scaling issue if we load entire data files of photometry for any nontrivial sample of galaxies. This issue can be closed when

[ ] rail.estimation.utils.load_data() provides an iterator to rail.estimation.Estimator.
[ ] rail.estimation.Estimator.pz_pdf is an iterator.
[ ] rail.estimation.Estimator.write_out() is moved to rail.estimation.utils.save_pz_pdf().

(This will also help with TXPipe compatibility when the time comes to extend rail.estimation.Estimator to a rail.summarization.Summarizer structure, for which a placeholder is being constructed for TXPipe development now.)

sschmidt23 commented 3 years ago

I was playing with a hacky version of an iterator in the issue/13/parallel_io branch. It's messy and probably not final, but works. I'll need to clean it up before we actually think about merging.

sschmidt23 commented 3 years ago

@aimalz Is there a reason to move the write out function from within the class to utils? It seems easier to have it in the superclass where the data lives, but I can move it outside of the class if that is needed.

sschmidt23 commented 3 years ago

Also, keeping the write out within the class would allow for easier custom write outs that override the generic superclass write_out, e.g. say we switch FlexZBoost to output basis coefficients instead of a redshift grid, or K nearest neighbor indices and their Mahalanobis distances, etc..., that can be done within a specific subclass write_out.

aimalz commented 3 years ago

@sschmidt23 Thanks for working on this!

Re: write_out function, a couple weeks ago, @joezuntz pointed out that it would be better to separate it from the estimators themselves in terms of how TXPipe could call RAIL, so the file handling is outside the estimator class and can be modified independently. It doesn't make a big difference for p(z) estimation, but it seems safer in the long run for when the RAIL p(z) estimator class may evolve into an n(z) summarizer wrapping infrastructure we have to write anyway for PZSummarize benchmarking.

I probably should have included links to the 3x2pt-related issues LSSTDESC/TXPipe/#99 and LSSTDESC/txpipe-reanalysis/#24 that motivated this issue. tl;dr People are starting to write n(z) estimator back-ends and need an interim superclass structure in the context of TXPipe validation, and, if I recall correctly, the thought was that moving the i/o out of the estimator class would make it easier to eventually swap in an evolved version of the RAIL estimator when the time comes. @joezuntz please correct me if I'm misremembering -- I want to jog my memory but can't seem to find where on GitHub the pseudocode you wrote on that call ended up.

In any case, in that context, it would actually be a feature rather than a bug to impose a uniform file write-out procedure because TXPipe will need to read in the files consistently regardless of the back-end. But, that won't necessarily prevent the subclasses (for p(z) or n(z)) from writing out to their optimal parameterization because the i/o function at the other end (so an n(z) summarizer or TXPipe stage requiring n(z), respectively) can just run them through qp to make sure they're in its expected parameterization, which would be distinct from the format of the files that the current i/o functions focus on.

(Please speak up if this appears to make no sense! It's been a sufficiently long day/week that all words feel like gibberish in my mouth.)

sschmidt23 commented 3 years ago

Ok, that makes sense, just wondering if we wanted a separate post-processing step or to force the consistency here. I moved the write functions back to utils.py. Right now several things are hard-coded, i.e. the dictionary created by the run_photoz function in each subclass must have zmode and pz_pdf keys. But, if we're looking for consistency, that is kinda what we're aiming for.

aimalz commented 3 years ago

I think #16 closed this.

LSSTDESC / rail_attic

Update i/o for huge files #13