Bug report [BUG] Ray temp is not writable when running create_cistopic_object_from_fragments with n_cpu>1

cflerin commented 3 years ago

Describe the bug When running create_cistopic_object_from_fragments with n_cpus>1, the default temp location for Ray can't be written to. Figured this out with @dweemx today.

To Reproduce

    tmp_cto = create_cistopic_object_from_fragments(path_to_fragments=fragments_dict[key],
                                                    path_to_regions=path_to_regions,
                                                    path_to_blacklist=path_to_blacklist,
                                                    metrics=metadata_bc_dict[key],
                                                    valid_bc=bc_passing_filters[key],
                                                    n_cpu=4,
                                                    project=key)

Error output

---------------------------------------------------------------------------
FileExistsError                           Traceback (most recent call last)
/tmp/ipykernel_36829/2911727400.py in <module>
      1 #Create objects
----> 2 cistopic_obj_list=[
      3     create_cistopic_object_from_fragments(
      4         path_to_fragments=DATA[key]['fragments'],
      5         path_to_regions=path_to_regions,
/tmp/ipykernel_36829/2911727400.py in <listcomp>(.0)
      1 #Create objects
      2 cistopic_obj_list=[
----> 3     create_cistopic_object_from_fragments(
      4         path_to_fragments=DATA[key]['fragments'],
      5         path_to_regions=path_to_regions,
/lustre1/project/stg_00002/lcb/dwmax/software/genius/miniconda3/4.7.10/envs/microglia/lib/python3.8/site-packages/pycisTopic/cistopic_class.py in create_cistopic_object_from_fragments(path_to_fragments, path_to_regions, path_to_blacklist, metrics, valid_bc, n_cpu, min_frag, min_cell, is_acc, check_for_duplicates, project, partition, fragments_df)
    770     # Count fragments in regions
    771     log.info('Counting fragments in regions')
--> 772     fragments_in_regions = regions.join(fragments, nb_cpu=n_cpu)
    773     # Convert to pandas
    774     counts_df = pd.concat([fragments_in_regions.regionID.astype("category"),
/lustre1/project/stg_00002/lcb/dwmax/software/genius/miniconda3/4.7.10/envs/microglia/lib/python3.8/site-packages/pyranges/pyranges.py in join(self, other, strandedness, how, report_overlap, slack, suffix, nb_cpu)
   2151             kwargs["example_header_self"] = self.head(1).df
   2152 
-> 2153         dfs = pyrange_apply(_write_both, self, other, **kwargs)
   2154         gr = PyRanges(dfs)
   2155 
/lustre1/project/stg_00002/lcb/dwmax/software/genius/miniconda3/4.7.10/envs/microglia/lib/python3.8/site-packages/pyranges/multithreaded.py in pyrange_apply(function, self, other, **kwargs)
    190         import ray
    191         with suppress_stdout_stderr():
--> 192             ray.init(num_cpus=nb_cpu, ignore_reinit_error=True)
    193 
    194     function, get, _merge_dfs = get_multithreaded_funcs(
/lustre1/project/stg_00002/lcb/dwmax/software/genius/miniconda3/4.7.10/envs/microglia/lib/python3.8/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
     60         if client_mode_should_convert():
     61             return getattr(ray, func.__name__)(*args, **kwargs)
---> 62         return func(*args, **kwargs)
     63 
     64     return wrapper
/lustre1/project/stg_00002/lcb/dwmax/software/genius/miniconda3/4.7.10/envs/microglia/lib/python3.8/site-packages/ray/worker.py in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, _enable_object_reconstruction, _redis_max_memory, _plasma_directory, _node_ip_address, _driver_object_store_memory, _memory, _redis_password, _temp_dir, _lru_evict, _metrics_export_port, _system_config, _tracing_startup_hook)
    800         # handler. We still spawn a reaper process in case the atexit handler
    801         # isn't called.
--> 802         _global_node = ray.node.Node(
    803             head=True,
    804             shutdown_at_exit=False,
/lustre1/project/stg_00002/lcb/dwmax/software/genius/miniconda3/4.7.10/envs/microglia/lib/python3.8/site-packages/ray/node.py in __init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
    164             self.session_name = ray._private.utils.decode(session_name)
    165 
--> 166         self._init_temp(redis_client)
    167 
    168         # If it is a head node, try validating if
/lustre1/project/stg_00002/lcb/dwmax/software/genius/miniconda3/4.7.10/envs/microglia/lib/python3.8/site-packages/ray/node.py in _init_temp(self, redis_client)
    276             self._temp_dir = ray._private.utils.decode(temp_dir)
    277 
--> 278         try_to_create_directory(self._temp_dir)
    279 
    280         if self.head:
/lustre1/project/stg_00002/lcb/dwmax/software/genius/miniconda3/4.7.10/envs/microglia/lib/python3.8/site-packages/ray/_private/utils.py in try_to_create_directory(directory_path)
    795     """
    796     directory_path = os.path.expanduser(directory_path)
--> 797     os.makedirs(directory_path, exist_ok=True)
    798     # Change the log directory permissions so others can use it. This is
    799     # important when multiple people are using the same machine.
/lustre1/project/stg_00002/lcb/dwmax/software/genius/miniconda3/4.7.10/envs/microglia/lib/python3.8/os.py in makedirs(name, mode, exist_ok)
    221             return
    222     try:
--> 223         mkdir(name, mode)
    224     except OSError:
    225         # Cannot rely on checking for EEXIST, since the operating system
FileExistsError: [Errno 17] File exists: '/tmp/ray'

Expected behavior Direct ray temp usage to another location and complete the function.

Screenshots N/A

Version (please complete the following information):

Python: Python 3.8
If a bug is related to another module
- pyranges==0.0.101

Additional context Apparently this is due to pyranges using Ray for parallelization, but there is no option to re-direct the temp location (that I know of). Possibly we can add the _temp_dir parameter and pass this on to pyranges. We did try os.environ["TMPDIR"] = "/path/to/new/tmp" but it does not seem to have any effect.

cbravo93 commented 3 years ago

Hi @cflerin @dweemx !

I have discussed this with several people already in slack, the error comes from the symbolic links Kris made to avoid writing directly in tmp (you may see it too if you don't set the temp_dir in other functions). In principle it shouldn't prompt the error, and just write to the linked folder, I will check it out with Kris. Which core/s have you tried on? I also leave below fast test code:

# Reproducible pyranges error
import pyranges as pr
f1 = pr.from_dict({'Chromosome': ['chr1', 'chr1', 'chr1'], 'Start': [3, 8, 5],
                   'End': [6, 9, 7], 'Name': ['interval1', 'interval3', 'interval2']})
f2 = pr.from_dict({'Chromosome': ['chr1', 'chr1'], 'Start': [1, 6],
                   'End': [2, 7], 'Name': ['a', 'b']})
f1.join(f2, nb_cpu=2) 
# Error

The best solution for now is to use only 1 core. The running time is very similar to running it with ray anyways (when using ray, there is a bit of time used to initialized the dashboard), so I could also force n_cpu = 1 internally instead of giving it as parameter. The operations we do with pyranges are quite fast, so we could use this to avoid ray altogether in these steps :).

For configuring _temp_dir without ray.init there is not a stable way unfortunately. We could also open an issue in pyranges, it would be minor changes in their multithreaded.py file; or directly in ray I think. The issue is that if _temp_dir=None in ray.init (default, and what pyranges uses) this is the function that is called to determine the _temp_dir:

def get_user_temp_dir():
    if sys.platform.startswith("darwin") or sys.platform.startswith("linux"):
        # Ideally we wouldn't need this fallback, but keep it for now for
        # for compatibility
        tempdir = os.path.join(os.sep, "tmp")
    else:
        tempdir = tempfile.gettempdir()
    return tempdir

Which basically forces it to be /tmp/ray, since our system is linux (not happy with their solution though).

So for now, use n_cpu=1 and I'll check with Kris if we can make it work with the symbolic links (this is relevant for other functions too). Otherwise, 1) Avoid ray from pyranges operations (not huge differences in performance anyways), 2) Open issue in pyranges to pass _temp_dir, 3) Open issue in ray to use python tmpdir instead of forcing it to be /tmp/ray (thee 2 latter will depend on the developers).

Cheers, keep you posted!

C

cflerin commented 3 years ago

I guess the easiest thing is to just set n_cpu=1 and leave it like that. Since there is little performance hit with this.

When running from the Singularity image, we can force tmp elsewhere with a volume mapping (singularity run -B /new/path/to/tmp:/tmp ...) and this works well also.

ghuls commented 3 years ago

@cflerin Probably worth to add to your cisTopic pycisTopic jupyter kernel config file.

cbravo93 commented 3 years ago

This seems to be solved in the development version at least: https://github.com/ray-project/ray/blob/fabba96fadf833dfeb9d10d9704debc9454f4815/python/ray/_private/utils.py (get_user_temp_dir), so updating ray to this version and setting os.environ['RAY_TMPDIR'] should work well too

cbravo93 commented 3 years ago

Indeed, installing the latest daily release (pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl, see https://docs.ray.io/en/master/installation.html for other than Python 3.7) and running the code below works now (also with 'TMPDIR' in this release). If you find this a good solution you can close the issue :).

# Reproducible pyranges error
import pyranges as pr
import os
os.putenv('TMPDIR','/scratch/leuven/313/vsc31305/ray_spill')
f1 = pr.from_dict({'Chromosome': ['chr1', 'chr1', 'chr1'], 'Start': [3, 8, 5],
                   'End': [6, 9, 7], 'Name': ['interval1', 'interval3', 'interval2']})
f2 = pr.from_dict({'Chromosome': ['chr1', 'chr1'], 'Start': [1, 6],
                   'End': [2, 7], 'Name': ['a', 'b']})
f1.join(f2, nb_cpu=2) 
# Works!

aertslab / pycisTopic

Bug report [BUG] Ray temp is not writable when running create_cistopic_object_from_fragments with n_cpu>1 #16