Closed TingTingShao closed 6 months ago
Hi TingTing
Can you check which barcodes are in the cistopic object and not the anndata and vice versa
cell_in_cistopic_not_anndata = set(cistopic_obj.cell_names) - set(adat2["barcode"])
cell_in_anndata_not_cistopic = set(adat2["barcode"]) - set(cistopic_obj.cell_names)
Best,
Seppe
Hi,
Thanks for your reply.
The output is as follows:
cell_in_cistopic_not_anndata = set(cistopic_obj.cell_names) - set(cell_data['barcode'])
cell_in_cistopic_not_anndata
{'CCTGGGAAGATACCAA-1:D19-13185___D19-13185',
'GAGCGCTTCAACTTGG-1:D19-12637___D19-12637',
'CCCACATAGTATGGGC-1:D19-122593___D19-122593',
'AACATCGTCATCGCCT-1:D19-12692___D19-12692',
'ACAGCGCTCGGGACAA-1:D19-12651___D19-12651',
'AAACGAACATGGCCTG-1:D19-8619___D19-8619',
'TTTGGCCTCGTGAACT-1:D19-8256___D19-8256',
'TTTGAGGTCGTTACAG-1:D19-12640___D19-12640',
'AGCGTATTCGGTTCCT-1:D19-12679___D19-12679',
'CAACCAACAGCAAACG-1:D19-8631___D19-8631',
'ATTTGTCAGTAATGTG-1:D19-8577___D19-8577',
'CGGTGCAAGGGAGATA-1:D19-122593___D19-122593',
'TGCTTCGCACTGCTCT-1:D19-12651___D19-12651',
'GGCGTTGGTAAACGTA-1:D19-8250___D19-8250',
'TACGCCTGTATGTTCG-1:D19-13165___D19-13165',
'CGATGATTCTCGTGAA-1:D19-12539___D19-12539',
'GCACCTTAGGCAGTGT-1:D19-13165___D19-13165',
'TCGATTTAGCTTCAAC-1:D19-8603___D19-8603',
'CGGTGCAGTCGCGCTA-1:D19-13165___D19-13165',
'ACTTCCGGTTAGGAGC-1:D19-12534___D19-12534',
'TCGCAGGAGACCATAA-1:D19-12691___D19-12691',
'GAAGTGGTCGGACGAA-1:D19-12637___D19-12637',
'GAAATGAGTAGCGGTA-1:D19-8577___D19-8577',
'TGATTTCTCCTAGAGT-1:D19-12544___D19-12544',
'TCAGTTTGTAACGTAA-1:D19-12684___D19-12684',
...
'GGGTGTCAGAACTCCT-1:D19-12691___D19-12691',
'TGGACATAGCATACCT-1:D19-12650___D19-12650',
'GAGCGCTTCTAGCAGT-1:D19-8247___D19-8247',
'TATGTGGTCTTGTACT-1:D19-125469___D19-125469',
...}
cell_in_anndata_not_cistopic
set()
To be sure there are more in cistopic, I tried to map the extra cells in the anndata. Nothing mapped.
Thanks, tingting
With the current split option, the following barcodes will match:
CCTGGGAAGATACCAA-1:D19-13185___D19-13185
CCTGGGAAGATACCAA-1:D19-45678___D19-45678
As only CCTGGGAAGATACCAA-1
is checked for a match. So this will pull in some wrong barcodes.
Instead of cell_data['barcode']
, you will have to have e.g. valid_bcs[key]
, where you subset your barcodes per fragments file.
I see, Thanks. I understand what you meant, will rerun later.
Can I ask a side question: how come the barcode in cisobject becomes CCTGGGAAGATACCAA-1:D19-45678___D19-45678
?
Thanks, tingting
it adds the sample name as ___${sample}
as it was originally made for CellRangerATAC fragment files which always have NNNNNNNNNNNNNNNN-1
. In the future there might be an option to disable this behaviour in case your barcodes already are unique over fragment files.
I see, thanks!
Instead I simply removed the NaNs, haven't run into some error with downstream analysis yet, not sure if it is problematic if I do this, could use your help here.
# remove NaNs
print(sum(cistopic_obj.cell_data['celltype'].isna()))
print(set(cistopic_obj.cell_names) - set(cell_data.index))
filtered_cell_data = cistopic_obj.cell_data.dropna(subset=['celltype'])
cistopic_obj.cell_data = filtered_cell_data
filtered_cell_names = [name for name in cistopic_obj.cell_names if name in filtered_cell_data.index]
cistopic_obj.cell_names = filtered_cell_names
Thanks, tingting
sample_to_barcodes = cell_data.groupby('sample')['barcode'].agg(list).to_dict()
cistopic_obj_list=[create_cistopic_object_from_fragments(path_to_fragments=fragments_dict[key],
path_to_regions=path_to_regions[key],
path_to_blacklist=config['path_to_blacklist'],
# metrics=metadata_bc[key],
valid_bc=sample_to_barcodes[key],
n_cpu=8,
# split_pattern= ":",
project=key) for key in sample_to_barcodes.keys()]
This solved my problem.
Thanks tingting
sample_to_barcodes = cell_data.groupby('sample')['barcode'].agg(list).to_dict() cistopic_obj_list=[create_cistopic_object_from_fragments(path_to_fragments=fragments_dict[key], path_to_regions=path_to_regions[key], path_to_blacklist=config['path_to_blacklist'], # metrics=metadata_bc[key], valid_bc=sample_to_barcodes[key], n_cpu=8, # split_pattern= ":", project=key) for key in sample_to_barcodes.keys()]
This solved my problem.
Thanks tingting
In my case, I also get the problem of NaNs in my sample_id and celltype column after adding the cell_data to the cistopic obj. So, as I understand did this step sample_to_barcodes = cell_data.groupby('sample')['barcode'].agg(list).to_dict()
solve your issue?
Hi,
I retrived the cell data from an anndata, and then use valid_bc=cell_data['barcode'] to select the cells when generating cistopic_obj, but at the end there are more nuclei (4011) in cistopic_obj than in the anndata, could I please ask what is the reason? Thanks!
Code: cell_data from anndata
TingTing