aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
163 stars 27 forks source link

AttributeError in export_pseudobulk with Python 3.8 #299

Closed elhaam closed 4 months ago

elhaam commented 4 months ago

Hello, and thank you for this helpful framework!

I am running this script using pycisTopic v1.0.0 from pbmc_multiome_tutorial.ipynb: (other newer versions of pycisTopic had errors with export_pseudobulk so I had to use this older version.)

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                        # location of fragment fiels
                 n_cpu = 8,                                                                                 # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
                 split_pattern = '-')

I get the following output which has less number of clusters. Also, a bigger concern is that it does not generate any bed.gz files (the subdirectory has no bed.gz), as mentioned in the tutorial:

2024-02-13 22:30:04,416 cisTopic     INFO     Reading fragments from pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz
2024-02-13 22:30:48,198 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
(export_pseudobulk_ray pid=920949) 2024-02-13 22:30:50,599 cisTopic     INFO     Creating pseudobulk for B_cells
(export_pseudobulk_ray pid=920950) 2024-02-13 22:30:50,955 cisTopic     INFO     Creating pseudobulk for B_cells_1
(export_pseudobulk_ray pid=920951) 2024-02-13 22:30:52,752 cisTopic     INFO     Creating pseudobulk for Dendritic_cells
(export_pseudobulk_ray pid=920946) 2024-02-13 22:30:53,495 cisTopic     INFO     Creating pseudobulk for FCGR3A_Monocytes [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)

I followed https://github.com/aertslab/scenicplus/issues/222 and changed n_cpu=8 to n_cpu=1 to fix this issue, but now I get this error:

2024-02-13 22:31:21,666 cisTopic     INFO     Reading fragments from pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz
2024-02-13 22:32:01,861 cisTopic     INFO     Creating pseudobulk for B_cells
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[8], line 2
      1 from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
----> 2 bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
      3                  variable = 'celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
      4                  sample_id_col = 'sample_id',
      5                  chromsizes = chromsizes,
      6                  bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
      7                  bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
      8                  path_to_fragments = fragments_dict,                                                        # location of fragment fiels
      9                  n_cpu = 1,                                                                                 # specify the number of cores to use, we use ray for multi processing
     10                  normalize_bigwig = True,
     11                  remove_duplicates = True,
     12                  _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
     13                  split_pattern = '-')

File ~/.local/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:162, in export_pseudobulk(input_data, variable, chromsizes, bed_path, bigwig_path, path_to_fragments, sample_id_col, n_cpu, normalize_bigwig, remove_duplicates, split_pattern, **kwargs)
    160     ray.shutdown()
    161 else:
--> 162     [export_pseudobulk_one_sample(
    163                 cell_data,
    164                 group,
    165                 fragments_df_dict,
    166                 chromsizes,
    167                 bigwig_path,
    168                 bed_path,
    169                 sample_id_col,
    170                 normalize_bigwig,
    171                 remove_duplicates,
    172                 split_pattern) for group in groups]
    173 bw_paths = {
    174     group: os.path.join(
    175         bigwig_path,
    176         str(group) +
    177         '.bw') for group in groups}
    178 bed_paths = {
    179     group: os.path.join(
    180         bed_path,
    181         str(group) +
    182         '.bed.gz') for group in groups}

File ~/.local/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:162, in <listcomp>(.0)
    160     ray.shutdown()
    161 else:
--> 162     [export_pseudobulk_one_sample(
    163                 cell_data,
    164                 group,
    165                 fragments_df_dict,
    166                 chromsizes,
    167                 bigwig_path,
    168                 bed_path,
    169                 sample_id_col,
    170                 normalize_bigwig,
    171                 remove_duplicates,
    172                 split_pattern) for group in groups]
    173 bw_paths = {
    174     group: os.path.join(
    175         bigwig_path,
    176         str(group) +
    177         '.bw') for group in groups}
    178 bed_paths = {
    179     group: os.path.join(
    180         bed_path,
    181         str(group) +
    182         '.bed.gz') for group in groups}

File ~/.local/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:263, in export_pseudobulk_one_sample(cell_data, group, fragments_df_dict, chromsizes, bigwig_path, bed_path, sample_id_col, normalize_bigwig, remove_duplicates, split_pattern)
    261 if isinstance(bigwig_path, str):
    262     if remove_duplicates:
--> 263         group_pr.to_bigwig(
    264             path=bigwig_path_group,
    265             chromosome_sizes=chromsizes,
    266             rpm=normalize_bigwig)
    267     else:
    268         group_pr.to_bigwig(
    269             path=bigwig_path_group,
    270             chromosome_sizes=chromsizes,
    271             rpm=normalize_bigwig,
    272             value_col='Score')

File ~/.local/lib/python3.8/site-packages/pyranges/pyranges_main.py:5506, in PyRanges.to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun, chain)
   5503 if chromosome_sizes is None:
   5504     chromosome_sizes = pr.data.chromsizes()
-> 5506 result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun)
   5508 if dryrun:
   5509     return result

File ~/.local/lib/python3.8/site-packages/pyranges/out.py:212, in _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun)
    208         new_pyrles[k] = v
    210     gr = c.defragment().to_ranges()
--> 212 unique_chromosomes = gr.chromosomes
    214 subset = ["Chromosome", "Start", "End", "Score"]
    216 gr = gr[subset].unstrand()

File ~/.local/lib/python3.8/site-packages/pandas/core/generic.py:5907, in NDFrame.__getattr__(self, name)
   5900 if (
   5901     name not in self._internal_names_set
   5902     and name not in self._metadata
   5903     and name not in self._accessors
   5904     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5905 ):
   5906     return self[name]
-> 5907 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'chromosomes'

I tried aertslab/pycisTopic@1afbd1d as discussed in another issue, but this actually makes more errors for peak calling so below code gets the following error:

Code:

# using pycisTopic 1afbd1d
from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                        # location of fragment fiels
                 n_cpu = 8,                                                                                 # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
                 split_pattern = '-')

error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 3
      1 # using pycisTopic commit 1afbd1d
      2 from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
----> 3 bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
      4                  variable = 'celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
      5                  sample_id_col = 'sample_id',
      6                  chromsizes = chromsizes,
      7                  bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
      8                  bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
      9                  path_to_fragments = fragments_dict,                                                        # location of fragment fiels
     10                  n_cpu = 8,                                                                                 # specify the number of cores to use, we use ray for multi processing
     11                  normalize_bigwig = True,
     12                  remove_duplicates = True,
     13                  _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
     14                  split_pattern = '-')

TypeError: export_pseudobulk() got an unexpected keyword argument 'remove_duplicates'

My scatac_fragment_tools version is 0.1.0 (also tried tag 5a3f538) and Python version is 3.8.18

Would you please help me with this? Thank you so much in advance!

elhaam commented 4 months ago

I solved the above issues following suggestions here: https://github.com/aertslab/scenicplus/issues/297

Here is a summary of the suggestions if someone faces the same issue I did:

  1. Comment out remove_duplicates = True since it is not needed anymore (old functionality, but missing in the current function definition)
  2. _temp_dir must be changed to temp_dir
  3. Run this command to generate tbi files for fragments: tabix -p bed pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz
SeppeDeWinter commented 4 months ago

@elhaam

Thank you for the summary.

the zcat command does not generate the index though (tbi file). This is done using the tabix command,


tabix -p bed pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz

All the best,

Seppe