aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
58 stars 12 forks source link

export_pseudobulk "Dataframe object has no attribute 'chromosomes' #104

Closed bhhlee closed 10 months ago

bhhlee commented 10 months ago

First of all thank you for making such a great tool. I am using export_pseudobulk on my scATACseq data from my multiome dataset (multiple samples). For most of the annotated cell types, I am getting an error where 'DataFrame' object has no attribute 'chromosomes'

# Get chromosome sizes (for hg38 here)
import pyranges as pr
import requests
target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
chromsizes=pd.read_csv(target_url, sep='\t', header=None)
chromsizes.columns=['Chromosome', 'End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
# Exceptionally in this case, to agree with CellRangerARC annotations
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].replace('v', '.') for x in range(len(chromsizes['Chromosome']))]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].split('_')[1] if len(chromsizes['Chromosome'][x].split('_')) > 1 else chromsizes['Chromosome'][x] for x in range(len(chromsizes['Chromosome']))]
chromsizes=pr.PyRanges(chromsizes)
tmp_dir = '/scratch/'

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                        # location of fragment fiels
                 n_cpu = 16,                                                                                 # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 split_pattern = '___')

Error:

2024-01-16 14:51:32,054 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265/ 
(export_pseudobulk_ray pid=1554857) 2024-01-16 14:52:24,234 cisTopic     INFO     Creating pseudobulk for Bcell
(export_pseudobulk_ray pid=1554857) /data/bhlee/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:274: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
(export_pseudobulk_ray pid=1554857)   group_fragments = group_fragments_list[0].append(group_fragments_list[1:])
(raylet) Spilled 85719 MiB, 2 objects, write throughput 545 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(export_pseudobulk_ray pid=1554861) 2024-01-16 14:56:01,637 cisTopic     INFO     Creating pseudobulk for CD4SPTcell
2024-01-16 14:57:06,245 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::export_pseudobulk_ray() (pid=1554857, ip=147.47.206.72)
  File "/data/bhlee/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py", line 346, in export_pseudobulk_ray
    export_pseudobulk_one_sample(
  File "/data/bhlee/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py", line 285, in export_pseudobulk_one_sample
    group_pr.to_bigwig(
  File "/data/bhlee/envs/scenicplus/lib/python3.8/site-packages/pyranges/pyranges_main.py", line 5506, in to_bigwig
    result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun)
  File "/data/bhlee/envs/scenicplus/lib/python3.8/site-packages/pyranges/out.py", line 212, in _to_bigwig
    unique_chromosomes = gr.chromosomes
  File "/data/bhlee/envs/scenicplus/lib/python3.8/site-packages/pandas/core/generic.py", line 5907, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'chromosomes'
(export_pseudobulk_ray pid=1554861) /data/bhlee/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:274: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
(export_pseudobulk_ray pid=1554861)   group_fragments = group_fragments_list[0].append(group_fragments_list[1:])
(raylet) Spilled 171439 MiB, 4 objects, write throughput 497 MiB/s.

For some additional context, I tried testing export_pseudobulk on a single sample from the dataset, which at first gave the same attribute error, but after running it again, the error went away. However, it is not going away for the multiple sample whole dataset.

I am running: pycistopic 1.0.3.dev21+ge9b0e1a python 3.8.18

Any help would be greatly appreciated!

GGboy-Zzz commented 10 months ago

hello, I met the same error when I set n_cpu=1 in 'export_pseudobulk', yet this error wouldn't generate when I set n_cpu=5, which cant export 'bed/bw.gz files' correctly. I noticed the solution mentioned in sceniclus issue that modified the 'pseudobulk_peak_calling.py' (https://github.com/aertslab/scenicplus/issues/277). But I don't quite understand it, can you generously explain the specific operation method? @SeppeDeWinter

SeppeDeWinter commented 10 months ago

Hi @bhhlee and @GGboy-Zzz

I just merged some changes that should fix this issue, see https://github.com/aertslab/pycisTopic/commit/1afbd1d71dd9caf2f8f53d4c752240089b182bc9.

Can you try to rerun the code to see wether the issue is indeed fixed for you?

All the best,

Seppe

cbiagii commented 10 months ago

Hi @SeppeDeWinter,

I was having a similar problem, but when I installed pycisTopic using commit 1afbd1d the following error appeared when loading the export_pseudobulk function:

Command:

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cao385/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py", line 16, in <module>
    from scatac_fragment_tools.library.bigwig.fragments_to_bigwig import (
  File "/home/cao385/envs/scenicplus/lib/python3.8/site-packages/scatac_fragment_tools/library/bigwig/fragments_to_bigwig.py", line 31, in <module>
    def normalise_filepath(path: str | Path, check_not_directory: bool = True) -> str:
TypeError: unsupported operand type(s) for |: 'type' and 'type'

Thanks for your help! Carlos

SeppeDeWinter commented 10 months ago

Hi @cbiagii

This commit https://github.com/aertslab/scatac_fragment_tools/commit/5a3f5383d0b681ee1a407cc0053c4b109f9881ba should fix the issue.

The issue is related to the type annotations we used, which are only supported by newer versions of python. Now you should also be able to import the code using python version 3.8.

All the best,

Seppe