aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
56 stars 12 forks source link

cistopic_obj has unexpected more nuclei number #131

Closed TingTingShao closed 6 months ago

TingTingShao commented 6 months ago

Hi,

I retrived the cell data from an anndata, and then use valid_bc=cell_data['barcode'] to select the cells when generating cistopic_obj, but at the end there are more nuclei (4011) in cistopic_obj than in the anndata, could I please ask what is the reason? Thanks!

Code: cell_data from anndata

rule generate_cell_data:
    input:
        adata="microglia_1.h5ad"
    output:
        tsv="030results_cistopic/cell_data.tsv"
    run:
        adat=anndata.read_h5ad(input.adata)
        adat2=adat.copy()
        adat2.obs['barcode']=[x.rsplit(":",1)[1]for x in adat2.obs.index]
        adat2.obs[['barcode', 'sample']]
        # atac_bc=adat2.obs['barcode']
        cell_data=adat2.obs
        cell_data['celltype'] = cell_data[config['leiden_mnc_resolution']].astype(str) 
        cell_data.to_csv(output.tsv, sep = '\t', header = True, index = False)
cell_data = pd.read_csv(input.tsv, sep = '\t')
cell_data['barcode'] = cell_data['sample']+':'+ cell_data['barcode'] 
fragments_dict=pickle.load(open(input.frag_paths_pkl, 'rb'))
unique_samples = set(cell_data['sample'])
path_to_regions= {sample: input.consensus_regions for sample in unique_samples} 
cistopic_obj_list=[create_cistopic_object_from_fragments(path_to_fragments=fragments_dict[key],
                                            path_to_regions=path_to_regions[key],
                                            path_to_blacklist=config['path_to_blacklist'],
                                            # metrics=metadata_bc[key],
                                            valid_bc=cell_data['barcode'], 
                                            n_cpu=1,
                                            split_pattern= ":",
                                            project=key) for key in fragments_dict.keys()]     
cistopic_obj = merge(cistopic_obj_list)

TingTing

SeppeDeWinter commented 6 months ago

Hi TingTing

Can you check which barcodes are in the cistopic object and not the anndata and vice versa


cell_in_cistopic_not_anndata = set(cistopic_obj.cell_names) - set(adat2["barcode"])
cell_in_anndata_not_cistopic = set(adat2["barcode"]) - set(cistopic_obj.cell_names)

Best,

Seppe

TingTingShao commented 6 months ago

Hi,

Thanks for your reply.

The output is as follows:

cell_in_cistopic_not_anndata = set(cistopic_obj.cell_names) - set(cell_data['barcode'])
cell_in_cistopic_not_anndata
{'CCTGGGAAGATACCAA-1:D19-13185___D19-13185',
 'GAGCGCTTCAACTTGG-1:D19-12637___D19-12637',
 'CCCACATAGTATGGGC-1:D19-122593___D19-122593',
 'AACATCGTCATCGCCT-1:D19-12692___D19-12692',
 'ACAGCGCTCGGGACAA-1:D19-12651___D19-12651',
 'AAACGAACATGGCCTG-1:D19-8619___D19-8619',
 'TTTGGCCTCGTGAACT-1:D19-8256___D19-8256',
 'TTTGAGGTCGTTACAG-1:D19-12640___D19-12640',
 'AGCGTATTCGGTTCCT-1:D19-12679___D19-12679',
 'CAACCAACAGCAAACG-1:D19-8631___D19-8631',
 'ATTTGTCAGTAATGTG-1:D19-8577___D19-8577',
 'CGGTGCAAGGGAGATA-1:D19-122593___D19-122593',
 'TGCTTCGCACTGCTCT-1:D19-12651___D19-12651',
 'GGCGTTGGTAAACGTA-1:D19-8250___D19-8250',
 'TACGCCTGTATGTTCG-1:D19-13165___D19-13165',
 'CGATGATTCTCGTGAA-1:D19-12539___D19-12539',
 'GCACCTTAGGCAGTGT-1:D19-13165___D19-13165',
 'TCGATTTAGCTTCAAC-1:D19-8603___D19-8603',
 'CGGTGCAGTCGCGCTA-1:D19-13165___D19-13165',
 'ACTTCCGGTTAGGAGC-1:D19-12534___D19-12534',
 'TCGCAGGAGACCATAA-1:D19-12691___D19-12691',
 'GAAGTGGTCGGACGAA-1:D19-12637___D19-12637',
 'GAAATGAGTAGCGGTA-1:D19-8577___D19-8577',
 'TGATTTCTCCTAGAGT-1:D19-12544___D19-12544',
 'TCAGTTTGTAACGTAA-1:D19-12684___D19-12684',
...
 'GGGTGTCAGAACTCCT-1:D19-12691___D19-12691',
 'TGGACATAGCATACCT-1:D19-12650___D19-12650',
 'GAGCGCTTCTAGCAGT-1:D19-8247___D19-8247',
 'TATGTGGTCTTGTACT-1:D19-125469___D19-125469',
 ...}
cell_in_anndata_not_cistopic
set()

To be sure there are more in cistopic, I tried to map the extra cells in the anndata. Nothing mapped.

image

Thanks, tingting

ghuls commented 6 months ago

With the current split option, the following barcodes will match:

CCTGGGAAGATACCAA-1:D19-13185___D19-13185
CCTGGGAAGATACCAA-1:D19-45678___D19-45678

As only CCTGGGAAGATACCAA-1 is checked for a match. So this will pull in some wrong barcodes.

Instead of cell_data['barcode'], you will have to have e.g. valid_bcs[key], where you subset your barcodes per fragments file.

TingTingShao commented 6 months ago

I see, Thanks. I understand what you meant, will rerun later.

Can I ask a side question: how come the barcode in cisobject becomes CCTGGGAAGATACCAA-1:D19-45678___D19-45678?

Thanks, tingting

ghuls commented 6 months ago

it adds the sample name as ___${sample} as it was originally made for CellRangerATAC fragment files which always have NNNNNNNNNNNNNNNN-1. In the future there might be an option to disable this behaviour in case your barcodes already are unique over fragment files.

TingTingShao commented 6 months ago

I see, thanks!

Instead I simply removed the NaNs, haven't run into some error with downstream analysis yet, not sure if it is problematic if I do this, could use your help here.

# remove NaNs
print(sum(cistopic_obj.cell_data['celltype'].isna()))
print(set(cistopic_obj.cell_names) - set(cell_data.index))
filtered_cell_data = cistopic_obj.cell_data.dropna(subset=['celltype'])
cistopic_obj.cell_data = filtered_cell_data

filtered_cell_names = [name for name in cistopic_obj.cell_names if name in filtered_cell_data.index]
cistopic_obj.cell_names = filtered_cell_names

Thanks, tingting

TingTingShao commented 6 months ago
sample_to_barcodes = cell_data.groupby('sample')['barcode'].agg(list).to_dict()
cistopic_obj_list=[create_cistopic_object_from_fragments(path_to_fragments=fragments_dict[key],
                                            path_to_regions=path_to_regions[key],
                                            path_to_blacklist=config['path_to_blacklist'],
                                            # metrics=metadata_bc[key],
                                            valid_bc=sample_to_barcodes[key],
                                            n_cpu=8,
                                            # split_pattern= ":",
                                            project=key) for key in sample_to_barcodes.keys()]   

This solved my problem.

Thanks tingting

yojetsharma commented 1 week ago
sample_to_barcodes = cell_data.groupby('sample')['barcode'].agg(list).to_dict()
cistopic_obj_list=[create_cistopic_object_from_fragments(path_to_fragments=fragments_dict[key],
                                            path_to_regions=path_to_regions[key],
                                            path_to_blacklist=config['path_to_blacklist'],
                                            # metrics=metadata_bc[key],
                                            valid_bc=sample_to_barcodes[key],
                                            n_cpu=8,
                                            # split_pattern= ":",
                                            project=key) for key in sample_to_barcodes.keys()]   

This solved my problem.

Thanks tingting

In my case, I also get the problem of NaNs in my sample_id and celltype column after adding the cell_data to the cistopic obj. So, as I understand did this step sample_to_barcodes = cell_data.groupby('sample')['barcode'].agg(list).to_dict()solve your issue?