This PR provides a fix to the issue stated in https://github.com/aertslab/scenicplus/issues/61.
How it's done: Scaffold chromosomes are filtered out in the export_pseudobulk function when the fragments are loaded as DataFrame using a regular expression and the pandas pd.Series.str.contains() function. I introduced a new parameter for the export_pseudobulk function called chrom_filter = None.
Example following the SCENIC+ tutorial for 10X multiome data:
from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
variable = 'celltype', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
sample_id_col = 'sample_id',
chromsizes = chromsizes,
bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored
bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
path_to_fragments = fragments_dict, # location of fragment files
chrom_filter = "GL|KI",
n_cpu = 8, # specify the number of cores to use, we use ray for multi processing
normalize_bigwig = True,
remove_duplicates = True,
_temp_dir = tmp_dir,
split_pattern = '-')
Output:
2023-08-22 11:16:41,366 cisTopic INFO Reading fragments from ../atac_fragments.tsv.gz
2023-08-22 11:19:37,550 cisTopic INFO Filtering out 33056 fragments.
2023-08-22 11:20:42,732 INFO worker.py:1627 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265/
(export_pseudobulk_ray pid=3011257) 2023-08-22 11:20:46,836 cisTopic INFO Creating pseudobulk for CT1
(export_pseudobulk_ray pid=3011259) 2023-08-22 11:20:46,829 cisTopic INFO Creating pseudobulk for CT2
(export_pseudobulk_ray pid=3011259) 2023-08-22 11:20:47,958 cisTopic INFO Creating pseudobulk for CT3
(export_pseudobulk_ray pid=3011259) 2023-08-22 11:20:50,278 cisTopic INFO CT3 done!
This PR provides a fix to the issue stated in https://github.com/aertslab/scenicplus/issues/61. How it's done: Scaffold chromosomes are filtered out in the
export_pseudobulk
function when the fragments are loaded as DataFrame using a regular expression and the pandaspd.Series.str.contains()
function. I introduced a new parameter for theexport_pseudobulk
function calledchrom_filter = None
.Example following the SCENIC+ tutorial for 10X multiome data:
Output:
Thank you for considering.