aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
165 stars 27 forks source link

export_pseudobulk no bed files #277

Open LILI-0000-0002-8173-7367 opened 6 months ago

LILI-0000-0002-8173-7367 commented 6 months ago

Hello, Seppe and Scenicplus team.

I encountered issues when I performed export_pseudobulk step. I have several multiome libraries. It was working back in July and I just re-run everything with some modification with the defined clusters.

Now, I noticed export_pseudobulk can't generate bed files and I also noticed that #262 and #275 also reported similar issues.

Here is my trials:

  1. Initially, it indicated I need to install pyrle. I installed pyrle, then ran export_pseudobulk with n_cpu = 72 (it works in July). However, there is no bed files generated.
    from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
    bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'Celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                        # location of fragment fiels
                 n_cpu = 72,                                                                                 # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = os.path.join(tmpDir, 'ray_spill'),
                 split_pattern = '___',
                 use_polars = True)

    I printed bw_paths and bed_paths, there are corrected.

    print(bed_paths)
    {'Agrp': '/work/InternalMedicine/***/project_otp/Pomc_CellHashing/scenicplus/scATAC/consensus_peak_calling/pseudobulk_bed_files/Agrp.bed.gz', 'Dlx6': '/work/InternalMedicine/***/project_otp/Pomc_CellHashing/scenicplus/scATAC/consensus_peak_calling/pseudobulk_bed_files/Dlx6.bed.gz', 'Fezf1': '/work/InternalMedicine/***/project_otp/Pomc_CellHashing/scenicplus/scATAC/consensus_peak_calling/pseudobulk_bed_files/Fezf1.bed.gz', 'Lhx1': '/work/InternalMedicine/***/project_otp/Pomc_CellHashing/scenicplus/scATAC/consensus_peak_calling/pseudobulk_bed_files/Lhx1.bed.gz',
    ....}

    Error output

(export_pseudobulk_ray pid=2386) 2023-12-27 15:12:22,349 cisTopic INFO Creating pseudobulk for Slc24a4

(export_pseudobulk_ray pid=2386) /home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:274: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. (export_pseudobulk_ray pid=2386) group_fragments = group_fragments_list[0].append(group_fragments_list[1:]) 2023-12-27 15:14:58,951 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::export_pseudobulk_ray() (pid=2386, ip=172.18.224.56) File "/home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py", line 346, in export_pseudobulk_ray export_pseudobulk_one_sample( File "/home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py", line 285, in export_pseudobulk_one_sample group_pr.to_bigwig( File "/home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/pyranges_main.py", line 5506, in to_bigwig result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun) File "/home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/out.py", line 212, in _to_bigwig unique_chromosomes = gr.chromosomes File "/home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pandas/core/generic.py", line 5907, in getattr return object.getattribute(self, name) AttributeError: 'DataFrame' object has no attribute 'chromosomes'

2. I tried to lower n_cpu = 1.  The error is "AttributeError: 'DataFrame' object has no attribute 'chromosomes'"

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk bw_paths, bed_paths = export_pseudobulk(input_data = cell_data, variable = 'Celltype', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype sample_id_col = 'sample_id', chromsizes = chromsizes, bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored path_to_fragments = fragments_dict, # location of fragment fiels n_cpu = 1, # specify the number of cores to use, we use ray for multi processing normalize_bigwig = True, remove_duplicates = True, _temp_dir = os.path.join(tmpDir, 'ray_spill'), split_pattern = '___', use_polars = True)

**Error output**

2023-12-27 12:48:24,070 cisTopic INFO Reading fragments from /work/InternalMedicine/s186179/project_otp/Pomc_CellHashing/Cellranger/PomcCre_Sun1GFP_2_Multiome_GEXplusATAC_20220406/outs/atac_fragments.tsv.gz 2023-12-27 12:49:08,646 cisTopic INFO Reading fragments from /work/InternalMedicine/s186179/project_otp/Pomc_CellHashing/Cellranger/PomcCre_Sun1GFP_3_Multiome_GEXplusATAC_20220902/outs/atac_fragments.tsv.gz 2023-12-27 12:54:47,229 cisTopic INFO Reading fragments from /work/InternalMedicine/s186179/project_otp/Pomc_CellHashing/Cellranger/PomcCre_Sun1GFP_4_Multiome_GEXplusATAC_20220902/outs/atac_fragments.tsv.gz 2023-12-27 12:57:49,152 cisTopic INFO Reading fragments from /work/InternalMedicine/s186179/project_otp/Pomc_CellHashing/Cellranger/PomcCre_Sun1GFP_5_Multiome_GEXplusATAC_20221220/outs/atac_fragments.tsv.gz 2023-12-27 13:01:18,559 cisTopic INFO Creating pseudobulk for Agrp


AttributeError Traceback (most recent call last) /tmp/ipykernel_18655/3796073447.py in ?() ----> 4 import pyrle 5 6 from pycisTopic.pseudobulk_peak_calling import export_pseudobulk 7 bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py in ?(input_data, variable, chromsizes, bed_path, bigwig_path, path_to_fragments, sample_id_col, n_cpu, normalize_bigwig, remove_duplicates, split_pattern, use_polars, **kwargs) 182 num_returns=len(groups), 183 ) 184 ray.shutdown() 185 else: --> 186 [ 187 export_pseudobulk_one_sample( 188 cell_data, 189 group,

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py in ?(.0) 186 def export_pseudobulk( --> 187 input_data: Union["CistopicObject", pd.DataFrame, Dict[str, pd.DataFrame]], 188 variable: str, 189 chromsizes: Union[pd.DataFrame, pr.PyRanges], 190 bed_path: str,

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py in ?(cell_data, group, fragments_df_dict, chromsizes, bigwig_path, bed_path, sample_id_col, normalize_bigwig, remove_duplicates, split_pattern) 281 group_pr = pr.PyRanges(group_fragments) 282 if isinstance(bigwig_path, str): 283 bigwig_path_group = os.path.join(bigwig_path, str(group) + ".bw") 284 if remove_duplicates: --> 285 group_pr.to_bigwig( 286 path=bigwig_path_group, 287 chromosome_sizes=chromsizes, 288 rpm=normalize_bigwig,

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/pyranges_main.py in ?(self, path, chromosome_sizes, rpm, divide, value_col, dryrun, chain) 5502 5503 if chromosome_sizes is None: 5504 chromosome_sizes = pr.data.chromsizes() 5505 -> 5506 result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun) 5507 5508 if dryrun: 5509 return result

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/out.py in ?(self, path, chromosome_sizes, rpm, divide, value_col, dryrun) 208 new_pyrles[k] = v 209 210 gr = c.defragment().to_ranges() 211 --> 212 unique_chromosomes = gr.chromosomes 213 214 subset = ["Chromosome", "Start", "End", "Score"] 215

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name) 5903 and name not in self._accessors 5904 and self._info_axis._can_hold_identifiers_and_holds_name(name) 5905 ): 5906 return self[name] -> 5907 return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'chromosomes'


3. I also tried **export_pseudobulk_one_sample** function as you mentioned in #262 .  The error is "AttributeError: 'DataFrame' object has no attribute 'chromosomes'".

Here is the code. 

print(set(cell_data["sample_id"])) print(set(cell_data["Celltype"]))

print(fragment_dict.keys())

print(fragments_dict) from pycisTopic.utils import ( read_fragments_from_file, prepare_tag_cells ) import pyranges as pr import re import numpy as np

variable = 'Celltype' sample_id_col = 'sample_id' chromsizes = chromsizes split_pattern = '___'

fragments_df_dict = {} for sample_id in fragments_dict.keys(): fragments_df = read_fragments_from_file( fragments_dict[sample_id], use_polars=True ).df fragments_df.Start = np.int32(fragments_df.Start) fragments_df.End = np.int32(fragments_df.End) if "Score" in fragments_df: fragments_df.Score = np.int32(fragments_df.Score) if "barcode" in cell_data: fragments_df = fragments_df.loc[ fragments_df["Name"].isin(cell_data["barcode"].tolist()) ] else: fragments_df = fragments_df.loc[ fragments_df["Name"].isin( prepare_tag_cells(cell_data.index.tolist(), split_pattern) ) ] fragments_df_dict[sample_id] = fragments_df if "barcode" in cell_data: cell_data = cell_data.loc[:, [variable, sample_id_col, "barcode"]] else: cell_data = cell_data.loc[:, [variable, sample_id_col]] cell_data[variable] = cell_data[variable].replace( " ", "", regex=True) cell_data[variable] = celldata[variable].replace( "[^A-Za-z0-9]+", "", regex=True) groups = sorted(list(set(cell_data[variable]))) if isinstance(chromsizes, pd.DataFrame): chromsizes = chromsizes.loc[:, ["Chromosome", "Start", "End"]] chromsizes = pr.PyRanges(chromsizes)

from pycisTopic.pseudobulk_peak_calling import * export_pseudobulk_one_sample( cell_data = cell_data, group = "Agrp", fragments_df_dict = fragments_df_dict, chromsizes = chromsizes, bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'), bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), sample_id_col = sample_id_col, normalize_bigwig = True, remove_duplicates = True, split_pattern = split_pattern )

**Error output**

2023-12-27 12:09:59,794 cisTopic INFO Creating pseudobulk for Agrp


AttributeError Traceback (most recent call last) /tmp/ipykernel_18655/1944494090.py in ?() ----> 2 from pycisTopic.pseudobulk_peak_calling import * 3 export_pseudobulk_one_sample( 4 cell_data = cell_data, 5 group = "Agrp",

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py in ?(cell_data, group, fragments_df_dict, chromsizes, bigwig_path, bed_path, sample_id_col, normalize_bigwig, remove_duplicates, split_pattern) 281 group_pr = pr.PyRanges(group_fragments) 282 if isinstance(bigwig_path, str): 283 bigwig_path_group = os.path.join(bigwig_path, str(group) + ".bw") 284 if remove_duplicates: --> 285 group_pr.to_bigwig( 286 path=bigwig_path_group, 287 chromosome_sizes=chromsizes, 288 rpm=normalize_bigwig,

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/pyranges_main.py in ?(self, path, chromosome_sizes, rpm, divide, value_col, dryrun, chain) 5502 5503 if chromosome_sizes is None: 5504 chromosome_sizes = pr.data.chromsizes() 5505 -> 5506 result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun) 5507 5508 if dryrun: 5509 return result

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/out.py in ?(self, path, chromosome_sizes, rpm, divide, value_col, dryrun) 208 new_pyrles[k] = v 209 210 gr = c.defragment().to_ranges() 211 --> 212 unique_chromosomes = gr.chromosomes 213 214 subset = ["Chromosome", "Start", "End", "Score"] 215

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name) 5903 and name not in self._accessors 5904 and self._info_axis._can_hold_identifiers_and_holds_name(name) 5905 ): 5906 return self[name] -> 5907 return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'chromosomes'


**fragments_df_dict**

{'POMC.2.Multiome': Chromosome Start End Name Score 0 GL456210.1 674 721 TGGTAAACACAAAGAC-1 2 1 GL456210.1 762 840 CATGCAAGTGCTGTAA-1 2 3 GL456210.1 1237 1314 GCTATAGGTGACATGC-1 7 7 GL456210.1 3973 4269 TCCTCAATCGTTACTT-1 1 10 GL456210.1 5799 6046 CTAGCTTGTGCATTAG-1 1 ... ... ... ... ... ... 46402681 chrY 90813548 90813613 TCTAGCGAGGCTACTG-1 3 46402682 chrY 90813552 90813627 CCTAAATCAGGCCAAA-1 5 46402687 chrY 90813565 90813613 CCTAAATCAGGCCAAA-1 1 46402689 chrY 90813598 90813759 CTAATTGAGTAGCTTA-1 4 46402699 chrY 90813819 90813894 TCCATGCTCCTAAGAC-1 3

[10729616 rows x 5 columns], 'POMC.3.Multiome': Chromosome Start End Name Score 0 GL456210.1 777 826 CTTGTTCCATGGTTAT-1 1 15 GL456210.1 4239 4389 GTCTAACAGTAAGAAC-1 1 17 GL456210.1 4319 4696 ATTGCGCCAGCACGTT-1 1 19 GL456210.1 4483 4675 AATAGAGGTTAGCGTA-1 4 35 GL456210.1 5295 5340 TTGTTGCGTTTGCTGT-1 1 ... ... ... ... ... ... 389415055 chrY 90829719 90829877 ACTCGCTTCTTGCAGG-1 4 389415058 chrY 90829766 90829877 GCTGACCAGGATCACT-1 3 389415059 chrY 90829808 90829871 GGTGCTTCAGAGAGCC-1 2 389415060 chrY 90829808 90829882 GAACCTGTCGATTCTT-1 6 389415064 chrY 90832009 90832046 AGTAACGAGGGTGAAC-1 1

[148888034 rows x 5 columns], 'POMC.4.Multiome': Chromosome Start End Name Score 5 GL456210.1 788 840 GAACGAATCAATCTCT-1 2 6 GL456210.1 796 840 AGGTCAAAGCTGGCTA-1 1 8 GL456210.1 1229 1530 CATTGTGCAGCAATAA-1 1 11 GL456210.1 3546 3894 TTTCTTGCATCCCTCA-1 1 17 GL456210.1 4573 4702 GTGCAAGCATTGTGCA-1 2 ... ... ... ... ... ... 207573434 chrY 90829569 90829755 TAAGCCAGTGGATGTC-1 2 207573435 chrY 90829572 90829741 ATATAGGCAAGCTTTG-1 1 207573441 chrY 90829611 90829762 GATTACTCAACAACAA-1 1 207573442 chrY 90829613 90829762 GATTACTCAACAACAA-1 1 207573451 chrY 90836714 90837044 CATGCGCAGTAGGATG-1 1

[73288379 rows x 5 columns], 'POMC.5.Multiome': Chromosome Start End Name Score 5 GL456210.1 634 1461 GTCCTCAGTGGATGTC-1 2 12 GL456210.1 671 756 AATTTCCTCCAGGAAA-1 2 17 GL456210.1 713 929 AACAAAGGTTGTAACG-1 3 21 GL456210.1 766 926 GACCTGCAGTAGCGGG-1 3 29 GL456210.1 788 830 GCTGTACCAACACCTA-1 3 ... ... ... ... ... ... 224486712 chrY 90836633 90836956 CATCACACAGTTTCTC-1 1 224486714 chrY 90836640 90836776 GGTATTTCAACTCGCG-1 3 224486715 chrY 90836649 90836764 GAGGTACAGGGATGAC-1 3 224486716 chrY 90836651 90836764 GAGGTACAGGGATGAC-1 1 224486719 chrY 90836676 90836828 TTTGAGTCACCAACCG-1 1

[87360539 rows x 5 columns]}



It looks like pd.DataFrame does not match to Dict[str, pd.DataFrame] or pr.PyRanges. But how?
Can you help me to sovle the issue? Many thanks!

-Li 

**Version (please complete the following information):**
 - Python: [Python 3.10]
 - SCENIC+: [1.0.1.dev6+ge5ba6fc]
 - pycisTopic : [1.0.3.dev21+ge9b0e1a]
ethanfenton commented 6 months ago

I had this same error. I was able to get around this by changing the pyranges/out.py file that you have here: /home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/out.py.

In the _to_bigwig() function I changed:

if not divide:
    gr = self.to_rle(rpm=rpm, strand=False, value_col=value_col).to_ranges()

to

if not divide:
    gr = self
SeppeDeWinter commented 6 months ago

Hi @LILI-0000-0002-8173-7367 and @ethanfenton

Thank you for the bug report.

We are currently in the process of rewriting this function, see this branch https://github.com/aertslab/pycisTopic/tree/improve_create_pseudobulk.

The code in this branch should already be functional (and fully compatible with downstream functions), although not optimal yet.

After optimising and testing the code some more we will merge it in the main branch, but feel free to already use it.

All the best,

Seppe

LMBei3353 commented 5 months ago

Hi Seppe,Thanks for all your work. After replacing the files in the original .lib/pycisTopic folder with those from https://github.com/aertslab/pycisTopic/tree/improve_create_pseudobulk, when I run the export_pseudobulk function, it prompts me that the fragments.tsv.gz.tbi file is no found. However, this file is not present in my data. As a beginner, I am not sure how to obtain the .tbi file. Thanks!

Abigail575 commented 5 months ago

Hi LMBei3353,

It should be in the same folder as your atac_fragment.tsv file. The paths may look something like this: ...Count/Sample1/outs/atac_fragments.tsv.gz ...Count/Sample1/outs/atac_fragments.tsv.gz.tbi

Hope you find it! Abigail

LILI-0000-0002-8173-7367 commented 5 months ago

Hello, @SeppeDeWinter.

I can run export_pseudobulk with the improved code. Thanks again for your help.

Li

LILI-0000-0002-8173-7367 commented 5 months ago

@ethanfenton

Thanks for your input.

LMBei3353 commented 5 months ago

@Abigail575 Thanks for your help.I successfully compressed the 'fragments.tsv.gz' file into a '.tbi' file using the tabix tool, and it ran without any issues.