expected string or bytes-like object when running export_pseudobulk

anan81 commented 2 years ago

Hello, I got the following error when running export_pseudobulk() to generate pseudobulk ATAC-seq profiles. My data is seperate scRNA-seq and scATAC-seq from different cells but the same sample. All cell types have been annotated. Data types of "sample_id" and "cluster" columns are string. Do you have any idea to solve that issue? Many thanks in advance.

TypeError Traceback (most recent call last) Input In [39], in <cell line: 5>() 3 ray.shutdown() 4 sys.stderr = open(os.devnull, "w") # silence stderr ----> 5 bw_paths, bed_paths = export_pseudobulk(input_data = cell_data_ATAC, 6 variable = 'cluster', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype 7 sample_id_col = 'sample_id', 8 chromsizes = chromsizes, 9 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored 10 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored 11 path_to_fragments = fragments_dict, # location of fragment files 12 n_cpu = 8, # specify the number of cores to use, we use ray for multi processing 13 normalize_bigwig = True, 14 remove_duplicates = True, 15 _temp_dir = os.path.join(tmp_dir, 'ray_spill'), 16 split_pattern = '-') 17 sys.stderr = sys.stderr

File /vsc-hard-mounts/leuven-data/328/vsc32848/miniconda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:128, in export_pseudobulk(input_data, variable, chromsizes, bed_path, bigwig_path, path_to_fragments, sample_id_col, n_cpu, normalize_bigwig, remove_duplicates, split_pattern, use_polars, **kwargs) 122 fragments_df = fragments_df.loc[ 123 fragments_df["Name"].isin(cell_data["barcode"].tolist()) 124 ] 125 else: 126 fragments_df = fragments_df.loc[ 127 fragments_df["Name"].isin( --> 128 prepare_tag_cells(cell_data.index.tolist(), split_pattern) 129 ) 130 ] 131 fragments_df_dict[sample_id] = fragments_df 133 # Set groups

File /vsc-hard-mounts/leuven-data/328/vsc32848/miniconda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/utils.py:183, in prepare_tag_cells(cell_names, split_pattern) 181 def prepare_tag_cells(cell_names, split_pattern="___"): 182 if split_pattern == "-": --> 183 new_cell_names = [ 184 re.findall(r"^[ACGT]-[0-9]+-", x)[0].rstrip("-") 185 if len(re.findall(r"^[ACGT]-[0-9]+-", x)) != 0 186 else x 187 for x in cell_names 188 ] 189 new_cell_names = [ 190 re.findall(r"^\w-[0-9]", new_cell_names[i])[0].rstrip("-") 191 if (len(re.findall(r"^\w-[0-9]", new_cell_names[i])) != 0) (...) 194 for i in range(len(new_cell_names)) 195 ] 196 else:

File /vsc-hard-mounts/leuven-data/328/vsc32848/miniconda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/utils.py:185, in (.0) 181 def prepare_tag_cells(cell_names, split_pattern="___"): 182 if split_pattern == "-": 183 new_cell_names = [ 184 re.findall(r"^[ACGT]-[0-9]+-", x)[0].rstrip("-") --> 185 if len(re.findall(r"^[ACGT]-[0-9]+-", x)) != 0 186 else x 187 for x in cell_names 188 ] 189 new_cell_names = [ 190 re.findall(r"^\w-[0-9]", new_cell_names[i])[0].rstrip("-") 191 if (len(re.findall(r"^\w-[0-9]", new_cell_names[i])) != 0) (...) 194 for i in range(len(new_cell_names)) 195 ] 196 else:

File /vsc-hard-mounts/leuven-data/328/vsc32848/miniconda/envs/scenicplus/lib/python3.8/re.py:241, in findall(pattern, string, flags) 233 def findall(pattern, string, flags=0): 234 """Return a list of all non-overlapping matches in the string. 235 236 If one or more capturing groups are present in the pattern, return (...) 239 240 Empty matches are included in the result.""" --> 241 return _compile(pattern, flags).findall(string)

TypeError: expected string or bytes-like object

cbravo93 commented 2 years ago

Hi @anan81 !

Can you provide the exact command you are running? How does cell_data look like (can you post the head?)?

Cheers!

Carmen

anan81 commented 2 years ago

Hi Carmen, this is command I used:

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
import ray
ray.shutdown()
sys.stderr = open(os.devnull, "w")  # silence stderr
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data_ATAC,
                 variable = 'cluster',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                        # location of fragment files
                 n_cpu = 8,                                                                                 # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
                 split_pattern = '-')
sys.stderr = sys.__stderr__  # unsilence stderr

And my cell data looks like this: Screenshot 2022-09-08 at 16 23 43

cbravo93 commented 2 years ago

Hi @anan81 !

I think I see the problem. Can you rename the cell_id column to barcode ? By default it will look for a column called barcode, if it is not present it will take the index of the dataframe (which in your case is not set, and causes it to crash).

You can find further explanations on how it works here: https://pycistopic.readthedocs.io/en/latest/Single_sample_workflow-RTD.html

C

anan81 commented 2 years ago

Thank you, Carmen. It worked after I renamed cell_id column to barcode.

aertslab / scenicplus

expected string or bytes-like object when running export_pseudobulk #32