aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
58 stars 12 forks source link

compute_qc_stats #126

Closed TingTingShao closed 8 months ago

TingTingShao commented 8 months ago

Hi,

With earlier version of pycistopic, I was able to run:

metadata_bc, profile_data_dict = compute_qc_stats(
                fragments_dict = fragments_dict,
                tss_annotation = annot,
                stats=['barcode_rank_plot', 'duplicate_rate', 'insert_size_distribution', 'profile_tss', 'frip'],
                label_list = None,
                path_to_regions = path_to_regions,
                n_cpu = 1,
                valid_bc = None,
                n_frag = 100,
                n_bc = None,
                tss_flank_window = 1000,
                tss_window = 50,
                tss_minimum_signal_window = 100,
                tss_rolling_window = 10,
                remove_duplicates = True,
                _temp_dir = os.path.join(tmp_dir + 'ray_spill'))

But with version 2.0a0, I was not able to run this commad, I guess now it follows

def compute_qc_stats(
    fragments_df_pl: pl.DataFrame,
    regions_df_pl: pl.DataFrame,
    tss_annotation: pl.DataFrame,
    tss_flank_window: int = 2000,
    tss_smoothing_rolling_window: int = 10,
    tss_minimum_signal_window: int = 100,
    tss_window: int = 50,
    tss_min_norm: float = 0.2,
    use_genomic_ranges: bool = True,
    min_fragments_per_cb: int = 10,
    collapse_duplicates: bool = True,
    no_threads: int = 8,
)

But this command returns different data formats than before, I am following the tutorial https://scenicplus.readthedocs.io/en/latest/mix_melanoma_cell_lines.html

Could I please ask

  1. how to continue with the tutorial with the updated version of pycistopic
  2. If this is an easier/efficient way, how to generate cistopic with anndata

Before you suggested two solutions in my case https://github.com/aertslab/pycisTopic/discussions/116

But with the solution

cistopic_obj = create_cistopic_object(
    fragment_matrix = adat.X,
    cell_names = adat.obs_names,
    region_names = adat.var_names)

Error persists

ValueError in file /vsc-hard-mounts/leuven-data/351/vsc35107/master_thesis/pycistopic_pipeline/Snakefile, line 53:
8458 columns passed, passed data had 6062095 columns

I tried

cistopic_obj = create_cistopic_object(
    fragment_matrix = adat.X.T,
    cell_names = adat.obs_names,
    region_names = adat.var_names)

error

2024-03-29 22:03:06,703 snakemake.logging ERROR    RuleException:
ValueError in file /vsc-hard-mounts/leuven-data/351/vsc35107/master_thesis/pycistopic_pipeline/Snakefile, line 53:
setting an array element with a sequence.

I followed the thread here https://github.com/aertslab/pycisTopic/issues/40 but it consumes a lot of memroty

Unable to allocate 382. GiB for an array with shape (6062095, 8458) and data type int64

Looking forward to your reply Many thanks tingting

TingTingShao commented 8 months ago

Update:

The anndata (from snapATAC2) is generated after QC, though the first run with the earlier version of pycistopic has filtered more cells in the QC step, I think it might be okay to create cistopic with cells filtered in snapATAC2.

So I tried:

rule create_cisobject:
    input:
        # adata="microglia_1.h5ad",
        frag_paths_pkl="030results/frag_paths.pkl",
        tsv="030results/cell_data.tsv",
        consensus_regions="033results_consensus_peak_calling/consensus_regions.bed"
    output:
        obj_pkl="cistopic_obj1.pkl"
    run:
        cell_data = pd.read_csv(input.tsv, sep = '\t')
        cell_data['barcode'] = cell_data['sample']+':'+ cell_data['barcode'] 
        fragments_dict=pickle.load(open(input.frag_paths_pkl, 'rb'))
        unique_samples = set(cell_data['sample'])
        path_to_regions= {sample: input.consensus_regions for sample in unique_samples} 
        cistopic_obj_list=[create_cistopic_object_from_fragments(path_to_fragments=fragments_dict[key],
                                                    path_to_regions=path_to_regions[key],
                                                    path_to_blacklist=config['path_to_blacklist'],
                                                    # metrics=metadata_bc[key],
                                                    valid_bc=cell_data['barcode'],
                                                    n_cpu=1,
                                                    project=key) for key in fragments_dict.keys()]     
        cistopic_obj = merge(cistopic_obj_list)
        cistopic_obj.add_cell_data(cell_data[['sample']])
        pickle.dump(cistopic_obj,
                    open(output.obj_pkl, 'wb'))   

However error came out as:

Length mismatch: Expected axis has 9373 elements, new values have 9265 elements
  File "/vsc-hard-mounts/leuven-data/351/vsc35107/master_thesis/pycistopic_pipeline/Snakefile", line 244, in __rule_create_cisobject
  File "/data/leuven/351/vsc35107/miniconda3/envs/scenicplus/lib/python3.11/site-packages/pycisTopic/cistopic_class.py", line 140, in add_cell_data
  File "/data/leuven/351/vsc35107/miniconda3/envs/scenicplus/lib/python3.11/site-packages/pandas/core/generic.py", line 5920, in __setattr__
  File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
  File "/data/leuven/351/vsc35107/miniconda3/envs/scenicplus/lib/python3.11/site-packages/pandas/core/generic.py", line 822, in _set_axis
  File "/data/leuven/351/vsc35107/miniconda3/envs/scenicplus/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 228, in set_axis
  File "/data/leuven/351/vsc35107/miniconda3/envs/scenicplus/lib/python3.11/site-packages/pandas/core/internals/base.py", line 70, in _validate_set_axis

Any idea on this?

Supplementary info:

AnnData object with n_obs × n_vars = 8458 × 6062095
    obs: 'sample', 'region', 'subject', 'ad', 'tsse', 'leiden_mnc_1'
    var: 'count', 'selected'
    uns: 'AnnDataSet', 'reference_sequences', 'spectral_eigenvalue'
    obsm: 'X_spectral', 'X_spectral_mnc_sample', 'X_umap', 'X_umap_mnc_sample', 'fragment_paired'
    obsp: 'distances'

Thanks tingting

TingTingShao commented 8 months ago

Solved with

    run:
        cell_data = pd.read_csv(input.tsv, sep = '\t')
        cell_data['barcode'] = cell_data['sample']+':'+ cell_data['barcode'] 
        fragments_dict=pickle.load(open(input.frag_paths_pkl, 'rb'))
        unique_samples = set(cell_data['sample'])
        path_to_regions= {sample: input.consensus_regions for sample in unique_samples} 
        cistopic_obj_list=[create_cistopic_object_from_fragments(path_to_fragments=fragments_dict[key],
                                                    path_to_regions=path_to_regions[key],
                                                    path_to_blacklist=config['path_to_blacklist'],
                                                    # metrics=metadata_bc[key],
                                                    valid_bc=cell_data['barcode'],
                                                    n_cpu=1,
                                                    split_pattern= ":",
                                                    project=key) for key in fragments_dict.keys()]     
        cistopic_obj = merge(cistopic_obj_list)
        cistopic_obj.add_cell_data(cell_data[['sample']])
        pickle.dump(cistopic_obj,
                    open(output.obj_pkl, 'wb'))  
yojetsharma commented 2 months ago

Update:

The anndata (from snapATAC2) is generated after QC, though the first run with the earlier version of pycistopic has filtered more cells in the QC step, I think it might be okay to create cistopic with cells filtered in snapATAC2.

So I tried:

rule create_cisobject:
    input:
        # adata="microglia_1.h5ad",
        frag_paths_pkl="030results/frag_paths.pkl",
        tsv="030results/cell_data.tsv",
        consensus_regions="033results_consensus_peak_calling/consensus_regions.bed"
    output:
        obj_pkl="cistopic_obj1.pkl"
    run:
        cell_data = pd.read_csv(input.tsv, sep = '\t')
        cell_data['barcode'] = cell_data['sample']+':'+ cell_data['barcode'] 
        fragments_dict=pickle.load(open(input.frag_paths_pkl, 'rb'))
        unique_samples = set(cell_data['sample'])
        path_to_regions= {sample: input.consensus_regions for sample in unique_samples} 
        cistopic_obj_list=[create_cistopic_object_from_fragments(path_to_fragments=fragments_dict[key],
                                                    path_to_regions=path_to_regions[key],
                                                    path_to_blacklist=config['path_to_blacklist'],
                                                    # metrics=metadata_bc[key],
                                                    valid_bc=cell_data['barcode'],
                                                    n_cpu=1,
                                                    project=key) for key in fragments_dict.keys()]     
        cistopic_obj = merge(cistopic_obj_list)
        cistopic_obj.add_cell_data(cell_data[['sample']])
        pickle.dump(cistopic_obj,
                    open(output.obj_pkl, 'wb'))   

However error came out as:

Length mismatch: Expected axis has 9373 elements, new values have 9265 elements
  File "/vsc-hard-mounts/leuven-data/351/vsc35107/master_thesis/pycistopic_pipeline/Snakefile", line 244, in __rule_create_cisobject
  File "/data/leuven/351/vsc35107/miniconda3/envs/scenicplus/lib/python3.11/site-packages/pycisTopic/cistopic_class.py", line 140, in add_cell_data
  File "/data/leuven/351/vsc35107/miniconda3/envs/scenicplus/lib/python3.11/site-packages/pandas/core/generic.py", line 5920, in __setattr__
  File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
  File "/data/leuven/351/vsc35107/miniconda3/envs/scenicplus/lib/python3.11/site-packages/pandas/core/generic.py", line 822, in _set_axis
  File "/data/leuven/351/vsc35107/miniconda3/envs/scenicplus/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 228, in set_axis
  File "/data/leuven/351/vsc35107/miniconda3/envs/scenicplus/lib/python3.11/site-packages/pandas/core/internals/base.py", line 70, in _validate_set_axis

Any idea on this?

Supplementary info:

AnnData object with n_obs × n_vars = 8458 × 6062095
    obs: 'sample', 'region', 'subject', 'ad', 'tsse', 'leiden_mnc_1'
    var: 'count', 'selected'
    uns: 'AnnDataSet', 'reference_sequences', 'spectral_eigenvalue'
    obsm: 'X_spectral', 'X_spectral_mnc_sample', 'X_umap', 'X_umap_mnc_sample', 'fragment_paired'
    obsp: 'distances'

Thanks tingting

I have also run snapatac2 for my dataset (1-healthy and 2-patients concantated). Can you please help in how can I use the anndata from snapatac2 for pycisTopic?