Closed eboileau closed 1 year ago
@eboileau
/tmp
partition is relatively small but there is often another scratch partition offering temporary storage on the node that you can use
export TMPDIR=/path/to/local/sratch
before runningdf -ih
We also need to fix the cmdstanpy verbose output...
Does this work?
import logging
cmdstanpy_logger = logging.getLogger("cmdstanpy")
cmdstanpy_logger.disabled = True
https://mc-stan.org/cmdstanpy/users-guide/outputs.html#logging
Or is the verbose output coming directly from the stan executable rather than from CmdStanPy?
@lkeegan thanks for your feedback.
Yes, for the first issue, I'm currently looking into this. CmdStanPy uses the Python library tempfile module, and there is also an output_dir
argument, I'm not sure whether this could be used to pass a /scratch partition? I need to check the documentation. I'm running again, monitoring disk usage, etc.
For the verbose output, I will try this, thanks.
I'm not sure if setting the ouput_dir
arg would affect where any intermediate temporary files go - but you can directly set what tmpdir Python should use in the tempfile module by setting environment vars, e.g.
export TMPDIR=/path/to/local/scratch
see https://docs.python.org/3/library/tempfile.html#tempfile.mkstemp (this also applies to NamedTemporaryFile, TemporaryFile etc)
Ok, still running, but the conclusions are pretty clear... and it's "independent" of our own parallelisation implementation. Each time a model is sampled, there is a directory created e.g. /tmp/tmprq4hm10f/periodic-gaussian-mixturezzj4r4wr. Here, we sample two models (periodic-gaussian-mixture and gaussian-naive-bayes), but for each ORF, so for ~500,000 ORFs, this makes ~1M sub-directories... and each one contains 2 files per chain (we have 4 files), they're small ~ 50K... but for 1M of them we're quickly reaching over 50GB ... The tmp directory is only cleaned on exit using atexit.register(_cleanup_tmpdir)
...
One thing we could try is to specify output_dir
, and implement some cleanup routine... but this is not ideal, and might not even work, i.e. I don't know when/if files are used by cmdstanpy...
I don't really know what to do... i.e we could probably try setting TMPDIR, but this needs to be easy to handle for users, and platform-independent, besides for users with limited knowledge... they might not even know which directory to use... and in the worst case, they might not even have the required space...
So get_bayes_factor
is called in parallel, once for each ORF.
This function does a bunch of cmdstanpy stuff (which generates some files that get used by cmdstanpy) and returns a result.
You only care about the return value, not the generated files.
Is that correct?
If so it seems like it should be fine to set output_dir
within get_bayes_factor
to some known tmp dir that gets cleaned up when the function returns, e.g.
def get_bayes_factor(profile, translated_models, untranslated_models, args):
with tempfile.TemporaryDirectory() as tmpdirname:
...
*.sample(..., output_dir=tmpdirname)
....
Does that make sense, or am I missing something?
This could work, I will try!
Thanks, this seems to work for the small example. I will try with the larger dataset before setting this issue as resolved.
Maybe I should also check estimate-metagene-profile-bayes-factors
. Although I haven't had errors, it could be that we were just on the limit... but the way we call cmdstanpy is different there, so I need to see how many files are actually being written...
For the logging issue, cmdstanpy_logger.disabled = True
works fine, but I would like to keep the option to log for debugging. Unless it is set to False
, it pollutes all log files...
For estimate_metagene_profile_bayes_factors.py
, profiles are grouped by lengths, so e.g. if we have 35 lenghts, 4 models, and the profiles are 21 in length, this makes ~3000 /tmp files, and sampling is quicker, so this is why we do not see any significant disk and/or load footprint. I would leave it as is.
For the logging issue, I just left it off by default, and added a flag to turn it on for debugging. I'm not sure why, but in estimate_metagene_profile_bayes_factors.py
, this has to be done inside estimate_profile_bayes_factors
, if I leave it in the main, this doesn't seem to work...
Prerequisites Please answer the following questions for yourself before submitting an issue.
Description So far, we have run the tests on the small c-elegans dataset. I ran the pipeline on a larger dataset, and it ran until
for another sample, the error is a little different:
I don't know whether this is related to some interaction cmdstanpy and parallel? I doubt there is no space left on the cluster... unless maybe this is related to the number of open files...?
We also need to fix the cmdstanpy verbose output...
To Reproduce Run the pipeline
run-all-rpbp-instances ...
Environment