adrn / thejoker

A custom Monte Carlo sampler for the (gravitational) two-body problem
MIT License
30 stars 8 forks source link

write_table_hdf5 won't accept PriorSamples instance #139

Open ella918 opened 2 weeks ago

ella918 commented 2 weeks ago

There is a discrepancy between the documentation and the code for reading in prior files that have already been created. Specifically, in samples_helper.write_table_hdf5, the path to the prior file is required, but the docstring forTheJoker.rejection_sample says we could also pass in a PriorSamples instance. The prior file can't be read in already.

The PriorSamples instance is getting created over and over instead of being read in when doing multiprocessing. Specifically, in my code, this is causing an error on the hpc cluster I am using because the new prior samples are being written to a tmp folder with limited storage space.

@stephtdouglas thinks we can fix this one ourselves and submit a PR, but please let us know if we should not do this.

adrn commented 1 week ago

Thanks for reporting this! Can you clarify: is the error with reading or writing the prior files? Do you have some example code you can point me to that isn't doing what you expect? If there is an issue, yes it would be great to get a PR to fix this!

stephtdouglas commented 1 week ago

It's both, sort of. I haven't had her try to make an MWE yet because we're only getting an error when the /tmp/ directory on the cluster gets filled up when every job tries to recreate the priors (I think). I'll outline what we know here, and we'll work on an MWE as part of putting together a PR.

Ella is creating a single prior file ahead of time, then she successfully reads that back into a PriorSamples object. That part's working.

Then she passes that PriorSamples object to TheJoker.rejection_sample. We're not working in memory, so that PriorSamples object is passed to rejection_sample_helper.

While multiproc_helpers.rejection_sample_helper asks for a prior_samples_file, the code itself seems to only expect a path. And I'm a little fuzzy on exactly why this is happening, but the error is being raised as part of the tempfile_decorator wrapper function, where it tries to write the samples to a tempfile. On our HPC cluster the /tmp/ directory is small and basically fills up with a single prior file. Then we get a segfault because the /tmp/ directory is full for every object beyond the first.

By passing a path string into TheJoker.rejection_sample, we don't get the segfault.