pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
export_pseudobulk [BUG] #55

Open alexlenail opened 1 year ago

alexlenail commented 1 year ago

Thanks for making this package available. I submitted a slurm job with this function:

    bw_paths, bed_paths = export_pseudobulk(input_data=metadata,

I've been waiting for this step for about 96h. The log file reads:

2023-01-20 19:40:23,007 cisTopic     INFO     Reading fragments from ...sample1/outs/atac_fragments.tsv.gz
2023-01-20 19:46:34,790 cisTopic     INFO     Reading fragments from ...sample2/outs/atac_fragments.tsv.gz
2023-01-20 19:52:13,172 cisTopic     INFO     Reading fragments from ...sample3/outs/atac_fragments.tsv.gz
2023-01-20 19:58:06,604 cisTopic     INFO     Reading fragments from ...sample4/outs/atac_fragments.tsv.gz
2023-01-20 20:05:07,534 cisTopic     INFO     Reading fragments from ...sample5/outs/atac_fragments.tsv.gz
2023-01-20 20:10:28,990 cisTopic     INFO     Reading fragments from ...sample6/outs/atac_fragments.tsv.gz
2023-01-20 20:16:22,397 cisTopic     INFO     Reading fragments from ...sample7/outs/atac_fragments.tsv.gz
2023-01-20 20:23:02,828 cisTopic     INFO     Reading fragments from ...sample8/outs/atac_fragments.tsv.gz

And nothing else -- hasn't changed in 96h. Inspecting the job, it's using 2% of only 1 of the 48 cores, and 60GB RAM (of 260GB provided).

The tutorial suggests it should launch Ray jobs or something, something like

(export_pseudobulk_ray pid=12108) 2022-08-08 18:30:07,800 cisTopic INFO Creating pseudobulk for AST

but I'm not seeing that in my output. Does anything stick out to you as a mistake here?

cbravo93 commented 1 year ago

Hi @alexlenail !

Do you get any ray message at all? This message for instance:

2022-08-05 16:50:45,105 INFO -- View the Ray dashboard at

If not could you try:

Can you also show how metadata looks like?



alexlenail commented 1 year ago

Thanks for the help debugging, @cbravo93 !


Metadata looks like:


When I use n_cpu=1 though, I get this error message:


Which I can probably fix by changing the chromsizes inputs. But that leads me to wonder: maybe these errors weren't properly bubbling up when you use Ray / n_jobs > 1 ?

alexlenail commented 1 year ago


alexlenail commented 1 year ago

The filesystem I'm using doesn't support file locking on the home directory. Am I reading correctly that Ray is trying to lock a file to write metrics to? Is there a way to disable that? Maybe one of the **kwargs passed to ray.init() ?

cbravo93 commented 1 year ago

Hi Alex!

Can you put your _temp_dir path somewhere that is not home? That should fix the problem :)


dburkhardt commented 1 year ago

@cbravo93 just a note, I'm also having difficulty using Ray. The issue looks very similar to what @alexlenail is posting, and it took me a while to troubleshoot.

However, I do get a different error that suggests nodes aren't starting up.

cbravo93 commented 1 year ago

Hi @dburkhardt !

Can you paste your full command?. Can you try:

We normally do not work in /tmp in our system because you need some space for files to be written.


cbravo93 commented 1 year ago

@dburkhardt can you check ?

dburkhardt commented 1 year ago

Hi @dburkhardt !

Can you paste your full command?. Can you try:

* ray.init()

* ray.init(n_cpu=2)

* ray.init(_temp_dir=XXX)

We normally do not work in /tmp in our system because you need some space for files to be written.



These work

cbravo93 commented 1 year ago

@dburkhardt can you check with n_cpu=1 and see if any error pops up? Can you check chromsizes match with the fragments file?

dburkhardt commented 1 year ago

This does work with n_cpu=1 with correct output, and I can proceed with the rest of the analysis. The issue only comes up when trying to run with n_cpu>1.

cbravo93 commented 1 year ago

Does os.path.join(tmp_dir, 'ray_spill') exist? I just managed to reproduce the error with

dburkhardt commented 1 year ago

Interesting, if I just omit _temp_dir from the export_pseudobulk command, then it works fine. It seems like there are multiple issues with manually specifying this parameter, rather than using the default.

Looking at the ray documentation (

(There is not currently a stable way to change the root temporary directory when calling ray.init(), but if you need to, you can provide the _temp_dir argument to ray.init().)

So maybe best just to omit this?

cbravo93 commented 1 year ago

If you dont define _temp_dir it will use /tmp. If you have limited space in /tmp I would recommend to specify _temp_dir, especially downstream. What is your value for os.path.join(tmp_dir, 'ray_spill')?

dburkhardt commented 1 year ago

No, the temporary directory didn't exist before calling export_pseudobulk. I wouldn't expect it to need to exist before, though, because it's temporary. You don't check for that before passing the _temp_dir to ray?

cbravo93 commented 1 year ago

It is checked internally by ray. What is your value for os.path.join(tmp_dir, 'ray_spill')?

alexlenail commented 1 year ago

@cbravo93 did you run into this issue at all?

when I set _temp_dir = '/state/partition1/user/lenail' I get:

OSError: AF_UNIX path length cannot exceed 107 bytes: '/state/partition1/slurm_tmp/21322766.4294967291.0/ray/session_2023-02-02_13-15-02_941811_78876/sockets/plasma_store'

What do you set as _temp_dir ?

jflucier commented 1 year ago


I essentially noticed the same observation that were posted in this issue. To sum it up:

  1. when I remove _temp_dir an specify n_cpu>1, it sucessfully creates bed and bw files
  2. _temp_dir must exist prior to function call export_pseudobulk with n_cpu>1 or else it fails creating bed and bw files
  3. If _temp_dir is a "long path", an error is thrown OSError: AF_UNIX path length cannot exceed 107 bytes:. This is a known ray bug but doent seem to be resolve in version 2.4

Thank for all the comment in this issue. I agree with @dburkhardt, I think this directory should be automatically created when calling export_pseudobulk

massonix commented 1 year ago

Hi @cbravo93, thanks for developing this amazing tool!

I had a similar problem with ray and the export_pseudobulk function. In my case, I was not even able to initialize ray following the code you provided. Running:

import ray

Raised the following error:

2023-08-22 11:22:18,012 ERROR -- Failed to start the dashboard , return code 1
2023-08-22 11:22:18,013 ERROR -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See '' to find where the log file is.
2023-08-22 11:22:18,014 ERROR -- 
The last 20 lines of /scratch_tmp/10407352/ray/session_2023-08-22_11-22-13_701675_119199/logs/dashboard.log (it contains the error message from the dashboard): 
  File "/home/groups/singlecell/rmassoni/anaconda3/envs/scenicplus/lib/python3.8/importlib/", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/groups/singlecell/rmassoni/anaconda3/envs/scenicplus/lib/python3.8/site-packages/ray/dashboard/modules/log/", line 8, in <module>
    from ray.util.state.common import (
  File "/home/groups/singlecell/rmassoni/anaconda3/envs/scenicplus/lib/python3.8/site-packages/ray/util/state/", line 1, in <module>
    from ray.util.state.api import (
  File "/home/groups/singlecell/rmassoni/anaconda3/envs/scenicplus/lib/python3.8/site-packages/ray/util/state/", line 17, in <module>
    from ray.util.state.common import (
  File "/home/groups/singlecell/rmassoni/anaconda3/envs/scenicplus/lib/python3.8/site-packages/ray/util/state/", line 120, in <module>
  File "/home/groups/singlecell/rmassoni/anaconda3/envs/scenicplus/lib/python3.8/site-packages/pydantic/", line 139, in dataclass
    assert init is False, 'pydantic.dataclasses.dataclass only supports init=False'
AssertionError: pydantic.dataclasses.dataclass only supports init=False
2023-08-22 11:22:18,142 INFO -- Started a local Ray instance.

I checked this issue in the ray github, and the problem seems to be with the version 2 of pydantic. I downgraded pydantic as follows:

conda activate scenicplus
which python
~/anaconda3/envs/scenicplus/bin/pip install "pydantic<2"

And then everything worked. This should be fixed with the version 2.6 of pydantic.