aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
190 stars 31 forks source link

Peakcalling CalledprocessError #222

Closed kjiiii closed 10 months ago

kjiiii commented 1 year ago

Hi

First thank you for developing this amazing tool. I am using scenicplus in a ubuntu server conda environment, which is not a root account.

i was in a peak calling stage but an error which i haven't seen before came out here is my code

which macs2 -->/opt/anaconda3/envs/scenicplus/bin/macs2

I downloaded macs2 module with a pip and checked the path

import pickle
bed_paths = pickle.load(open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/bed_paths.pkl'), 'rb'))
bw_paths =  pickle.load(open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/bw_paths.pkl'), 'rb'))
from pycisTopic.pseudobulk_peak_calling import peak_calling
macs_path='/opt/anaconda3/envs/scenicplus/bin/macs2'
# Run peak calling
narrow_peaks_dict = peak_calling(macs_path,
                                 bed_paths,
                                 os.path.join(work_dir, 'scATAC/consensus_peak_calling/MACS/'),
                                 genome_size='hs',
                                 n_cpu=1,
                                 input_format='BEDPE',
                                 shift=73,
                                 ext_size=146,
                                 keep_dup = 'all',
                                 q_value = 0.05)

2023-09-07 16:03:20,976 cisTopic     INFO     Calling peaks for B_cells with /opt/anaconda3/envs/scenicplus/bin/macs2 callpeak --treatment nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells.bed.gz --name B_cells  --outdir nsclc/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
File ~/pycisTopic/pycisTopic/pseudobulk_peak_calling.py:697, in MACSCallPeak.call_peak(self)
    696 try:
--> 697     subprocess.check_output(args=cmd, shell=True, stderr=subprocess.STDOUT)
    698 except subprocess.CalledProcessError as e:

File /opt/anaconda3/envs/scenicplus/lib/python3.8/subprocess.py:415, in check_output(timeout, *popenargs, **kwargs)
    413     kwargs['input'] = empty
--> 415 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    416            **kwargs).stdout

File /opt/anaconda3/envs/scenicplus/lib/python3.8/subprocess.py:516, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    515     if check and retcode:
--> 516         raise CalledProcessError(retcode, process.args,
    517                                  output=stdout, stderr=stderr)
    518 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '/opt/anaconda3/envs/scenicplus/bin/macs2 callpeak --treatment nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells.bed.gz --name B_cells  --outdir nsclc/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[55], line 2
      1 # Run peak calling
----> 2 narrow_peaks_dict = peak_calling(macs_path,
      3                                  bed_paths,
      4                                  os.path.join(work_dir, 'scATAC/consensus_peak_calling/MACS/'),
      5                                  genome_size='hs',
      6                                  n_cpu=1,
      7                                  input_format='BEDPE',
      8                                  shift=73,
      9                                  ext_size=146,
     10                                  keep_dup = 'all',
     11                                  q_value = 0.05)

File ~/pycisTopic/pycisTopic/pseudobulk_peak_calling.py:435, in peak_calling(macs_path, bed_paths, outdir, genome_size, n_cpu, input_format, shift, ext_size, keep_dup, q_value, nolambda, **kwargs)
    433     ray.shutdown()
    434 else:
--> 435     narrow_peaks = [macs_call_peak(
    436                 macs_path,
    437                 bed_paths[name],
    438                 name,
    439                 outdir,
    440                 genome_size,
    441                 input_format,
    442                 shift,
    443                 ext_size,
    444                 keep_dup,
    445                 q_value,
    446                 nolambda,
    447             )
    448             for name in list(bed_paths.keys())
    449         ]
    450 narrow_peaks_dict = {
    451     list(bed_paths.keys())[i]: narrow_peaks[i].narrow_peak
    452     for i in range(len(narrow_peaks))
    453 }
    454 return narrow_peaks_dict

File ~/pycisTopic/pycisTopic/pseudobulk_peak_calling.py:435, in <listcomp>(.0)
    433     ray.shutdown()
    434 else:
--> 435     narrow_peaks = [macs_call_peak(
    436                 macs_path,
    437                 bed_paths[name],
    438                 name,
    439                 outdir,
    440                 genome_size,
    441                 input_format,
    442                 shift,
    443                 ext_size,
    444                 keep_dup,
    445                 q_value,
    446                 nolambda,
    447             )
    448             for name in list(bed_paths.keys())
    449         ]
    450 narrow_peaks_dict = {
    451     list(bed_paths.keys())[i]: narrow_peaks[i].narrow_peak
    452     for i in range(len(narrow_peaks))
    453 }
    454 return narrow_peaks_dict

File ~/pycisTopic/pycisTopic/pseudobulk_peak_calling.py:512, in macs_call_peak(macs_path, bed_path, name, outdir, genome_size, input_format, shift, ext_size, keep_dup, q_value, nolambda)
    509 logging.basicConfig(level=level, format=log_format, handlers=handlers)
    510 log = logging.getLogger("cisTopic")
--> 512 MACS_peak_calling = MACSCallPeak(
    513     macs_path,
    514     bed_path,
    515     name,
    516     outdir,
    517     genome_size,
    518     input_format=input_format,
    519     shift=shift,
    520     ext_size=ext_size,
    521     keep_dup=keep_dup,
    522     q_value=q_value,
    523     nolambda=nolambda,
    524 )
    525 log.info(f"{name} done!")
    526 return MACS_peak_calling

File ~/pycisTopic/pycisTopic/pseudobulk_peak_calling.py:658, in MACSCallPeak.__init__(self, macs_path, bed_path, name, outdir, genome_size, input_format, shift, ext_size, keep_dup, q_value, nolambda)
    656 self.qvalue = q_value
    657 self.nolambda = nolambda
--> 658 self.call_peak()

File ~/pycisTopic/pycisTopic/pseudobulk_peak_calling.py:699, in MACSCallPeak.call_peak(self)
    697     subprocess.check_output(args=cmd, shell=True, stderr=subprocess.STDOUT)
    698 except subprocess.CalledProcessError as e:
--> 699     raise RuntimeError(
    700         "command '{}' return with error (code {}): {}".format(
    701             e.cmd, e.returncode, e.output
    702         )
    703     )
    704 self.narrow_peak = self.load_narrow_peak()

RuntimeError: command '/opt/anaconda3/envs/scenicplus/bin/macs2 callpeak --treatment nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells.bed.gz --name B_cells  --outdir nsclc/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda' return with error (code 1): b'INFO  @ Thu, 07 Sep 2023 16:03:21: \n# Command line: callpeak --treatment nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells.bed.gz --name B_cells --outdir nsclc/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda\n# ARGUMENTS LIST:\n# name = B_cells\n# format = BEDPE\n# ChIP-seq file = [\'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells.bed.gz\']\n# control file = None\n# effective genome size = 2.70e+09\n# band width = 300\n# model fold = [5, 50]\n# qvalue cutoff = 5.00e-02\n# The maximum gap between significant sites is assigned as the read length/tag size.\n# The minimum length of peaks is assigned as the predicted fragment length "d".\n# Larger dataset will be scaled towards smaller dataset.\n# Range for calculating regional lambda is: 10000 bps\n# Broad region calling is off\n# Paired-End mode is on\n# Searching for subpeak summits is on\n \nINFO  @ Thu, 07 Sep 2023 16:03:21: #1 read fragment files... \nINFO  @ Thu, 07 Sep 2023 16:03:21: #1 read treatment fragments... \nTraceback (most recent call last):\n  File "/opt/anaconda3/envs/scenicplus/bin/macs2", line 653, in <module>\n    main()\n  File "/opt/anaconda3/envs/scenicplus/bin/macs2", line 51, in main\n    run( args )\n  File "/opt/anaconda3/envs/scenicplus/lib/python3.8/site-packages/MACS2/callpeak_cmd.py", line 64, in run\n    if options.PE_MODE: (treat, control) = load_frag_files_options (options)\n  File "/opt/anaconda3/envs/scenicplus/lib/python3.8/site-packages/MACS2/callpeak_cmd.py", line 349, in load_frag_files_options\n    tp = options.parser(options.tfile[0], buffer_size=options.buffer_size)\n  File "MACS2/IO/Parser.pyx", line 330, in MACS2.IO.Parser.GenericParser.__init__\n  File "/opt/anaconda3/envs/scenicplus/lib/python3.8/gzip.py", line 58, in open\n    binary_file = GzipFile(filename, gz_mode, compresslevel)\n  File "/opt/anaconda3/envs/scenicplus/lib/python3.8/gzip.py", line 173, in __init__\n    fileobj = self.myfileobj = builtins.open(filename, mode or \'rb\')\nFileNotFoundError: [Errno 2] No such file or directory: \'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells.bed.gz\'\n'

I appreciate your help Thank you so much.

nikithkurella commented 1 year ago

I'm running into this error as well

SeppeDeWinter commented 1 year ago

Hi both

Can you check wether the folder specified using bed_paths exists?

Best,

Seppe

kjiiii commented 1 year ago

/home/Jongin/scenicplus/nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files ls -> bed_paths.pkl bw_paths.pkl

bed_paths -> {'B_cells': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells.bed.gz', 'CD14_Monocytes': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD14_Monocytes.bed.gz', 'CD4_T_cells_1': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD4_T_cells_1.bed.gz', 'CD4_T_cells_2': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD4_T_cells_2.bed.gz', 'CD4_T_cells_3': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD4_T_cells_3.bed.gz', 'CD4_T_cells_4': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD4_T_cells_4.bed.gz', 'CD4_T_cells_5': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD4_T_cells_5.bed.gz', 'CD4_T_cells_6': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD4_T_cells_6.bed.gz', 'CD4_T_cells_7': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD4_T_cells_7.bed.gz', 'FCGR3A_Monocytes': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/FCGR3A_Monocytes.bed.gz', 'NK_cells': 'nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/NK_cells.bed.gz'}

You mean like this? Yes i think it surely exists.. bw_paths the same

SeppeDeWinter commented 1 year ago

Hi @kjiiii

I see that the paths are relative, are you inside the folder: scenicplus while this code is run? Maybe better to make them absolute paths.

Best

Seppe

nikithkurella commented 1 year ago

Hi @SeppeDeWinter

Unfortunately, the issue seems to persist even after making the paths absolute.

SeppeDeWinter commented 1 year ago

@nikithkurella

Okay. Could you show the head of the file?


zcat nsclc/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells.bed.gz | head
kjiiii commented 1 year ago

Oh i think i found the problem. In the previous step,

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk bw_paths, bed_paths = export_pseudobulk(input_data = cell_data, variable = 'celltype', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype sample_id_col = 'sample_id', chromsizes = chromsizes, bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored path_to_fragments = fragments_dict, # location of fragment fiels n_cpu = 8, # specify the number of cores to use, we use ray for multi processing normalize_bigwig = True, remove_duplicates = True, split_pattern = '-')

This step, the normal message must be like ->Creating pseudobulk for B cell -> B cell done! just like this.

But in my code,

2023-09-18 23:40:15,132 cisTopic INFO Reading fragments from nsclc/scATAC/GSM6449878_NSCLC8_TIL_eTreg_scATAC_fragments.tsv.gz 2023-09-18 23:41:26,979 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (export_pseudobulk_ray pid=28284) 2023-09-18 23:41:28,933 cisTopic INFO Creating pseudobulk for B_cells (export_pseudobulk_ray pid=28279) 2023-09-18 23:41:29,082 cisTopic INFO Creating pseudobulk for CD14_Monocytes (export_pseudobulk_ray pid=28280) 2023-09-18 23:41:29,469 cisTopic INFO Creating pseudobulk for NK_cells (export_pseudobulk_ray pid=28282) 2023-09-18 23:41:29,866 cisTopic INFO Creating pseudobulk for CD4_T_cells_6 [repeated 8x across cluster]

The message goes like this and the code ran without any error, and doesn't make the pseudobulk gzip files. Is it the reason of the error? and How can this problem be solved?

Thanks.

Jongin

kjiiii commented 1 year ago

And also, When ran the previous code with n_cpu = 1,

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk bw_paths, bed_paths = export_pseudobulk(input_data = cell_data, variable = 'celltype', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype sample_id_col = 'sample_id', chromsizes = chromsizes, bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored path_to_fragments = fragments_dict, # location of fragment fiels n_cpu = 1, # specify the number of cores to use, we use ray for multi processing normalize_bigwig = True, remove_duplicates = True, split_pattern = '-')

this error comes up

2023-09-18 23:49:19,080 cisTopic INFO Reading fragments from nsclc/scATAC/GSM6449878_NSCLC8_TIL_eTreg_scATAC_fragments.tsv.gz 2023-09-18 23:50:26,827 cisTopic INFO Creating pseudobulk for B_cells

ModuleNotFoundError Traceback (most recent call last) File ~/.local/lib/python3.8/site-packages/pyranges/methods/to_rle.py:6, in _to_rle(ranges, value_col, strand, rpm, **kwargs) 5 try: ----> 6 from pyrle import PyRles # type: ignore 7 from pyrle.methods import coverage # type: ignore

ModuleNotFoundError: No module named 'pyrle'

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last) Cell In[52], line 2 1 from pycisTopic.pseudobulk_peak_calling import export_pseudobulk ----> 2 bw_paths, bed_paths = export_pseudobulk(input_data = cell_data, 3 variable = 'celltype', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype 4 sample_id_col = 'sample_id', 5 chromsizes = chromsizes, 6 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored 7 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored 8 path_to_fragments = fragments_dict, # location of fragment fiels 9 n_cpu = 1, # specify the number of cores to use, we use ray for multi processing 10 normalize_bigwig = True, 11 remove_duplicates = True, 12 split_pattern = '-')

File ~/.local/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:186, in export_pseudobulk(input_data, variable, chromsizes, bed_path, bigwig_path, path_to_fragments, sample_id_col, n_cpu, normalize_bigwig, remove_duplicates, split_pattern, use_polars, **kwargs) 184 ray.shutdown() 185 else: --> 186 [ 187 export_pseudobulk_one_sample( 188 cell_data, 189 group, 190 fragments_df_dict, 191 chromsizes, 192 bigwig_path, 193 bed_path, 194 sample_id_col, 195 normalize_bigwig, 196 remove_duplicates, 197 split_pattern, 198 ) 199 for group in groups 200 ] 202 return bw_paths, bed_paths

File ~/.local/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:187, in (.0) 184 ray.shutdown() 185 else: 186 [ --> 187 export_pseudobulk_one_sample( 188 cell_data, 189 group, 190 fragments_df_dict, 191 chromsizes, 192 bigwig_path, 193 bed_path, 194 sample_id_col, 195 normalize_bigwig, 196 remove_duplicates, 197 split_pattern, 198 ) 199 for group in groups 200 ] 202 return bw_paths, bed_paths

File ~/.local/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:285, in export_pseudobulk_one_sample(cell_data, group, fragments_df_dict, chromsizes, bigwig_path, bed_path, sample_id_col, normalize_bigwig, remove_duplicates, split_pattern) 283 bigwig_path_group = os.path.join(bigwig_path, str(group) + ".bw") 284 if remove_duplicates: --> 285 group_pr.to_bigwig( 286 path=bigwig_path_group, 287 chromosome_sizes=chromsizes, 288 rpm=normalize_bigwig, 289 ) 290 else: 291 group_pr.to_bigwig( 292 path=bigwig_path_group, 293 chromosome_sizes=chromsizes, 294 rpm=normalize_bigwig, 295 value_col="Score", 296 )

File ~/.local/lib/python3.8/site-packages/pyranges/pyranges_main.py:5506, in PyRanges.to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun, chain) 5503 if chromosome_sizes is None: 5504 chromosome_sizes = pr.data.chromsizes() -> 5506 result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun) 5508 if dryrun: 5509 return result

File ~/.local/lib/python3.8/site-packages/pyranges/out.py:199, in _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun) 196 sys.exit(1) 198 if not divide: --> 199 gr = self.to_rle(rpm=rpm, strand=False, value_col=value_col).to_ranges() 200 else: 201 gr = self.to_rle(rpm=rpm, strand=False, value_col=value_col)

File ~/.local/lib/python3.8/site-packages/pyranges/pyranges_main.py:5881, in PyRanges.to_rle(self, value_col, strand, rpm, nb_cpu) 5877 strand = self.stranded 5879 from pyranges.methods.to_rle import _to_rle -> 5881 return _to_rle(self, value_col, strand=strand, rpm=rpm, nb_cpu=nb_cpu)

File ~/.local/lib/python3.8/site-packages/pyranges/methods/to_rle.py:9, in _to_rle(ranges, value_col, strand, rpm, **kwargs) 7 from pyrle.methods import coverage # type: ignore 8 except ImportError: ----> 9 raise Exception("Using the coverage method requires that pyrle is installed.") 11 _kwargs = { 12 "strand": strand, 13 "value_col": value_col, 14 "sparse": {"self": False}, 15 } # already sparse 16 kwargs.update(_kwargs)

Exception: Using the coverage method requires that pyrle is installed.

Thanks.

Jongin

kjiiii commented 1 year ago

I solve it ! i installed pyrle with pip, and ran again. And then, the pseudobulk file came out and the Error next step solved !

Thank you so much.

Best,

Jongin

nikithkurella commented 1 year ago

@SeppeDeWinter

I realized that the pseudobulk files were never exported to the paths I specified. I was able to resolve the issue by setting n_cpu=1 as I originally had it at 8. This allowed me to rectify the errors that were being thrown by export_pseudobulk and generate the bed files. This resolved the issue with the peak calling step.

tyc-1998 commented 11 months ago

Why have I tried all the above methods, but they still don't work.If n_cpu>1, the result obtained is as follows: image if n_cpu=1,the result obtained is as follows: image

kjiiii commented 11 months ago

@tyc-1998

Hi,

Can I see your full code and full error?

Jongin

tyc-1998 commented 11 months ago

@tyc-1998

你好,

我可以看看你的完整代码和完整错误吗?

钟仁

@kjiiii hi, My complete code and complete error on jupyter is as follows:

import os work_dir = 'pbmc_tutorial' import pycisTopic

set some figure parameters for nice display inside jupyternotebooks.

%matplotlib inline

make a directory for to store the processed scATAC-seq data.

if not os.path.exists(os.path.join(work_dir, 'scATAC')): os.makedirs(os.path.join(work_dir, 'scATAC')) fragments_dict = {'10x_pbmc': os.path.join(work_dir, 'data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz')} import scanpy as sc adata = sc.read_h5ad(os.path.join(work_dir, 'scRNA/adata.h5ad')) cell_data = adata.obs cell_data['sample_id'] = '10x_pbmc' cell_data['celltype'] = cell_data['celltype'].astype(str) # set data type of the celltype column to str, otherwise the export_pseudobulk function will complain. del(adata)

Get chromosome sizes (for hg38 here)

import pyranges as pr import requests import pandas as pd

target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'

chromsizes=pd.read_csv(os.path.join(work_dir,"data/hg38.chrom.sizes.txt"), sep='\t', header=None) chromsizes chromsizes.columns=['Chromosome', 'End'] chromsizes['Start']=[0]*chromsizes.shape[0] chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]

Exceptionally in this case, to agree with CellRangerARC annotations

chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].replace('v', '.') for x in range(len(chromsizes['Chromosome']))] chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].split('')[1] if len(chromsizes['Chromosome'][x].split('')) > 1 else chromsizes['Chromosome'][x] for x in range(len(chromsizes['Chromosome']))] chromsizes=pr.PyRanges(chromsizes) chromsizes tmp_dir='/data/temp' from pycisTopic.pseudobulk_peak_calling import export_pseudobulk bw_paths, bed_paths = export_pseudobulk(input_data = cell_data, variable = 'celltype', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype sample_id_col = 'sample_id', chromsizes = chromsizes, bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored path_to_fragments = fragments_dict, # location of fragment fiels n_cpu = 8, # specify the number of cores to use, we use ray for multi processing normalize_bigwig = True, remove_duplicates = True, _temp_dir = os.path.join(tmp_dir, 'ray_spill'), split_pattern = '-') image import pickle pickle.dump(bed_paths, open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/bed_paths.pkl'), 'wb')) pickle.dump(bw_paths, open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/bw_paths.pkl'), 'wb')) import pickle bed_paths = pickle.load(open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/bed_paths.pkl'), 'rb')) bw_paths = pickle.load(open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/bw_paths.pkl'), 'rb')) from pycisTopic.pseudobulk_peak_calling import peak_calling macs_path='macs2'

Run peak calling

narrow_peaks_dict = peak_calling(macs_path, bed_paths, os.path.join(work_dir, 'scATAC/consensus_peak_calling/MACS/'), genome_size='hs', n_cpu=1, input_format='BEDPE', shift=73, ext_size=146, keep_dup = 'all', q_value = 0.05, _temp_dir = os.path.join(tmp_dir, 'ray_spill')) 2023-12-18 16:31:43,026 cisTopic INFO Calling peaks for B_cells_1 with macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells_1.bed.gz --name B_cells_1 --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda

CalledProcessError Traceback (most recent call last) File /usr/local/lib/python3.9/site-packages/pycisTopic/pseudobulk_peak_calling.py:697, in MACSCallPeak.call_peak(self) 696 try: --> 697 subprocess.check_output(args=cmd, shell=True, stderr=subprocess.STDOUT) 698 except subprocess.CalledProcessError as e:

File /usr/local/lib/python3.9/subprocess.py:424, in check_output(timeout, *popenargs, *kwargs) 422 kwargs['input'] = empty --> 424 return run(popenargs, stdout=PIPE, timeout=timeout, check=True, 425 **kwargs).stdout

File /usr/local/lib/python3.9/subprocess.py:528, in run(input, capture_output, timeout, check, *popenargs, **kwargs) 527 if check and retcode: --> 528 raise CalledProcessError(retcode, process.args, 529 output=stdout, stderr=stderr) 530 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command 'macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells_1.bed.gz --name B_cells_1 --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last) Cell In[66], line 8 6 macs_path='macs2' 7 # Run peak calling ----> 8 narrow_peaks_dict = peak_calling(macs_path, 9 bed_paths, 10 os.path.join(work_dir, 'scATAC/consensus_peak_calling/MACS/'), 11 genome_size='hs', 12 n_cpu=1, 13 input_format='BEDPE', 14 shift=73, 15 ext_size=146, 16 keep_dup = 'all', 17 q_value = 0.05, 18 _temp_dir = os.path.join(tmp_dir, 'ray_spill'))

File /usr/local/lib/python3.9/site-packages/pycisTopic/pseudobulk_peak_calling.py:435, in peak_calling(macs_path, bed_paths, outdir, genome_size, n_cpu, input_format, shift, ext_size, keep_dup, q_value, nolambda, **kwargs) 433 ray.shutdown() 434 else: --> 435 narrow_peaks = [macs_call_peak( 436 macs_path, 437 bed_paths[name], 438 name, 439 outdir, 440 genome_size, 441 input_format, 442 shift, 443 ext_size, 444 keep_dup, 445 q_value, 446 nolambda, 447 ) 448 for name in list(bed_paths.keys()) 449 ] 450 narrow_peaks_dict = { 451 list(bed_paths.keys())[i]: narrow_peaks[i].narrow_peak 452 for i in range(len(narrow_peaks)) 453 } 454 return narrow_peaks_dict

File /usr/local/lib/python3.9/site-packages/pycisTopic/pseudobulk_peak_calling.py:435, in (.0) 433 ray.shutdown() 434 else: --> 435 narrow_peaks = [macs_call_peak( 436 macs_path, 437 bed_paths[name], 438 name, 439 outdir, 440 genome_size, 441 input_format, 442 shift, 443 ext_size, 444 keep_dup, 445 q_value, 446 nolambda, 447 ) 448 for name in list(bed_paths.keys()) 449 ] 450 narrow_peaks_dict = { 451 list(bed_paths.keys())[i]: narrow_peaks[i].narrow_peak 452 for i in range(len(narrow_peaks)) 453 } 454 return narrow_peaks_dict

File /usr/local/lib/python3.9/site-packages/pycisTopic/pseudobulk_peak_calling.py:512, in macs_call_peak(macs_path, bed_path, name, outdir, genome_size, input_format, shift, ext_size, keep_dup, q_value, nolambda) 509 logging.basicConfig(level=level, format=log_format, handlers=handlers) 510 log = logging.getLogger("cisTopic") --> 512 MACS_peak_calling = MACSCallPeak( 513 macs_path, 514 bed_path, 515 name, 516 outdir, 517 genome_size, 518 input_format=input_format, 519 shift=shift, 520 ext_size=ext_size, 521 keep_dup=keep_dup, 522 q_value=q_value, 523 nolambda=nolambda, 524 ) 525 log.info(f"{name} done!") 526 return MACS_peak_calling

File /usr/local/lib/python3.9/site-packages/pycisTopic/pseudobulk_peak_calling.py:658, in MACSCallPeak.init(self, macs_path, bed_path, name, outdir, genome_size, input_format, shift, ext_size, keep_dup, q_value, nolambda) 656 self.qvalue = q_value 657 self.nolambda = nolambda --> 658 self.call_peak()

File /usr/local/lib/python3.9/site-packages/pycisTopic/pseudobulk_peak_calling.py:699, in MACSCallPeak.call_peak(self) 697 subprocess.check_output(args=cmd, shell=True, stderr=subprocess.STDOUT) 698 except subprocess.CalledProcessError as e: --> 699 raise RuntimeError( 700 "command '{}' return with error (code {}): {}".format( 701 e.cmd, e.returncode, e.output 702 ) 703 ) 704 self.narrow_peak = self.load_narrow_peak()

RuntimeError: command 'macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells_1.bed.gz --name B_cells_1 --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda' return with error (code 1): b'INFO @ Mon, 18 Dec 2023 16:31:43: \n# Command line: callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells_1.bed.gz --name B_cells_1 --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda\n# ARGUMENTS LIST:\n# name = B_cells_1\n# format = BEDPE\n# ChIP-seq file = [\'pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells_1.bed.gz\']\n# control file = None\n# effective genome size = 2.70e+09\n# band width = 300\n# model fold = [5, 50]\n# qvalue cutoff = 5.00e-02\n# The maximum gap between significant sites is assigned as the read length/tag size.\n# The minimum length of peaks is assigned as the predicted fragment length "d".\n# Larger dataset will be scaled towards smaller dataset.\n# Range for calculating regional lambda is: 10000 bps\n# Broad region calling is off\n# Paired-End mode is on\n# Searching for subpeak summits is on\n \nINFO @ Mon, 18 Dec 2023 16:31:43: #1 read fragment files... \nINFO @ Mon, 18 Dec 2023 16:31:43: #1 read treatment fragments... \nTraceback (most recent call last):\n File "/usr/local/bin/macs2", line 653, in \n main()\n File "/usr/local/bin/macs2", line 51, in main\n run( args )\n File "/usr/local/lib/python3.9/site-packages/MACS2/callpeak_cmd.py", line 64, in run\n if options.PE_MODE: (treat, control) = load_frag_files_options (options)\n File "/usr/local/lib/python3.9/site-packages/MACS2/callpeak_cmd.py", line 349, in load_frag_files_options\n tp = options.parser(options.tfile[0], buffer_size=options.buffer_size)\n File "MACS2/IO/Parser.pyx", line 330, in MACS2.IO.Parser.GenericParser.init\n File "/usr/local/lib/python3.9/gzip.py", line 58, in open\n binary_file = GzipFile(filename, gz_mode, compresslevel)\n File "/usr/local/lib/python3.9/gzip.py", line 173, in init\n fileobj = self.myfileobj = builtins.open(filename, mode or \'rb\')\nFileNotFoundError: [Errno 2] No such file or directory: \'pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells_1.bed.gz\'\n'

kjiiii commented 11 months ago

@tyc-1998

The " _temp_dir = os.path.join(tmp_dir, 'ray_spill')" code may have made the error. will you delete this code and run with n_cpu = 1?

tyc-1998 commented 11 months ago

@tyc-1998

The " _temp_dir = os.path.join(tmp_dir, 'ray_spill')" code may have made the error. will you delete this code and run with n_cpu = 1?

@kjiiii If n_cpu=1,I get the following result: My code is like this: from pycisTopic.pseudobulk_peak_calling import export_pseudobulk bw_paths, bed_paths = export_pseudobulk(input_data = cell_data, variable = 'celltype', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype sample_id_col = 'sample_id', chromsizes = chromsizes, bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored path_to_fragments = fragments_dict, # location of fragment fiels n_cpu = 1, # specify the number of cores to use, we use ray for multi processing normalize_bigwig = True, remove_duplicates = True,

_temp_dir = os.path.join(tmp_dir, 'ray_spill'),

             split_pattern = '-')

error AttributeError Traceback (most recent call last) Cell In[7], line 2 1 from pycisTopic.pseudobulk_peak_calling import export_pseudobulk ----> 2 bw_paths, bed_paths = export_pseudobulk(input_data = cell_data, 3 variable = 'celltype', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype 4 sample_id_col = 'sample_id', 5 chromsizes = chromsizes, 6 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored 7 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored 8 path_to_fragments = fragments_dict, # location of fragment fiels 9 n_cpu = 1, # specify the number of cores to use, we use ray for multi processing 10 normalize_bigwig = True, 11 remove_duplicates = True, 12 #_temp_dir = os.path.join(tmp_dir, 'ray_spill'), 13 split_pattern = '-')

File /usr/local/lib/python3.9/site-packages/pycisTopic/pseudobulk_peak_calling.py:186, in export_pseudobulk(input_data, variable, chromsizes, bed_path, bigwig_path, path_to_fragments, sample_id_col, n_cpu, normalize_bigwig, remove_duplicates, split_pattern, use_polars, **kwargs) 184 ray.shutdown() 185 else: --> 186 [ 187 export_pseudobulk_one_sample( 188 cell_data, 189 group, 190 fragments_df_dict, 191 chromsizes, 192 bigwig_path, 193 bed_path, 194 sample_id_col, 195 normalize_bigwig, 196 remove_duplicates, 197 split_pattern, 198 ) 199 for group in groups 200 ] 202 return bw_paths, bed_paths

File /usr/local/lib/python3.9/site-packages/pycisTopic/pseudobulk_peak_calling.py:187, in (.0) 184 ray.shutdown() 185 else: 186 [ --> 187 export_pseudobulk_one_sample( 188 cell_data, 189 group, 190 fragments_df_dict, 191 chromsizes, 192 bigwig_path, 193 bed_path, 194 sample_id_col, 195 normalize_bigwig, 196 remove_duplicates, 197 split_pattern, 198 ) 199 for group in groups 200 ] 202 return bw_paths, bed_paths

File /usr/local/lib/python3.9/site-packages/pycisTopic/pseudobulk_peak_calling.py:285, in export_pseudobulk_one_sample(cell_data, group, fragments_df_dict, chromsizes, bigwig_path, bed_path, sample_id_col, normalize_bigwig, remove_duplicates, split_pattern) 283 bigwig_path_group = os.path.join(bigwig_path, str(group) + ".bw") 284 if remove_duplicates: --> 285 group_pr.to_bigwig( 286 path=bigwig_path_group, 287 chromosome_sizes=chromsizes, 288 rpm=normalize_bigwig, 289 ) 290 else: 291 group_pr.to_bigwig( 292 path=bigwig_path_group, 293 chromosome_sizes=chromsizes, 294 rpm=normalize_bigwig, 295 value_col="Score", 296 )

File /usr/local/lib/python3.9/site-packages/pyranges/pyranges_main.py:5506, in PyRanges.to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun, chain) 5503 if chromosome_sizes is None: 5504 chromosome_sizes = pr.data.chromsizes() -> 5506 result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun) 5508 if dryrun: 5509 return result

File /usr/local/lib/python3.9/site-packages/pyranges/out.py:212, in _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun) 208 new_pyrles[k] = v 210 gr = c.defragment().to_ranges() --> 212 unique_chromosomes = gr.chromosomes 214 subset = ["Chromosome", "Start", "End", "Score"] 216 gr = gr[subset].unstrand()

File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:5907, in NDFrame.getattr(self, name) 5900 if ( 5901 name not in self._internal_names_set 5902 and name not in self._metadata 5903 and name not in self._accessors 5904 and self._info_axis._can_hold_identifiers_and_holds_name(name) 5905 ): 5906 return self[name] -> 5907 return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'chromosomes'

SITRR commented 10 months ago

@tyc-1998 I get the same errors: with n_cpu = 8 the function does not output .bed.gz files with n_cpu = 1 the function outputs the error "AttributeError: 'DataFrame' object has no attribute 'chromosomes"

Have you managed to figure it out? @SeppeDeWinter @kjiiii could you assist with these issues?

Thanks!

tyc-1998 commented 10 months ago

Unfortunately, I didn't solve it because of time.

---Original--- From: @.> Date: Fri, Jan 19, 2024 00:14 AM To: @.>; Cc: @.**@.>; Subject: Re: [aertslab/scenicplus] Peakcalling CalledprocessError (Issue #222)

@tyc-1998 I get the same errors: with n_cpu = 8 the function does not output .bed.gz files with n_cpu = 1 the function outputs the error "AttributeError: 'DataFrame' object has no attribute 'chromosomes"

Have you managed to figure it out? @SeppeDeWinter @kjiiii could you assist with these issues?

Thanks!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

SeppeDeWinter commented 10 months ago

Hi @tyc-1998 and @SITRR

I just pushed some changes that should solve this issue, see: https://github.com/aertslab/pycisTopic/commit/1afbd1d71dd9caf2f8f53d4c752240089b182bc9.

Can you reinstall pycisTopic and test wether these changes indeed solve your issue?

All the best,

Seppe

SITRR commented 10 months ago

@SeppeDeWinter

Thank you for your response!

Unfortunately, it didn't work for me but I suspect it's because the barcodes don't match between my cell_data dataframe and my fragments.tsv.gz files. I'm new to python, but I'll try to troubleshoot using the pycisTopic tutorial and two github pages pycisTopic/issues/41 and pycisTopic/issues/59. If I'm unable to within the next few days, I'll open a separate issue.

Best, Susana

SeppeDeWinter commented 10 months ago

Hi Susana

Ok. If you need more assistance, feel free top open a new discussion.

I'm closing this issue for now.

All the best,

Seppe