aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
58 stars 12 forks source link

:fire: add chromosome scaffold filtering #88

Open mbuttner opened 1 year ago

mbuttner commented 1 year ago

This PR provides a fix to the issue stated in https://github.com/aertslab/scenicplus/issues/61. How it's done: Scaffold chromosomes are filtered out in the export_pseudobulk function when the fragments are loaded as DataFrame using a regular expression and the pandas pd.Series.str.contains() function. I introduced a new parameter for the export_pseudobulk function called chrom_filter = None.

Example following the SCENIC+ tutorial for 10X multiome data:

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                        # location of fragment files
                 chrom_filter = "GL|KI",
                 n_cpu = 8,                                                                                 # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = tmp_dir,
                 split_pattern = '-')

Output:

2023-08-22 11:16:41,366 cisTopic     INFO     Reading fragments from ../atac_fragments.tsv.gz
2023-08-22 11:19:37,550 cisTopic     INFO     Filtering out 33056 fragments.
2023-08-22 11:20:42,732 INFO worker.py:1627 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265/ 
(export_pseudobulk_ray pid=3011257) 2023-08-22 11:20:46,836 cisTopic     INFO     Creating pseudobulk for CT1
(export_pseudobulk_ray pid=3011259) 2023-08-22 11:20:46,829 cisTopic     INFO     Creating pseudobulk for CT2
(export_pseudobulk_ray pid=3011259) 2023-08-22 11:20:47,958 cisTopic     INFO     Creating pseudobulk for CT3
(export_pseudobulk_ray pid=3011259) 2023-08-22 11:20:50,278 cisTopic     INFO     CT3 done!

Thank you for considering.