Closed aaaaaaaaaayong closed 3 months ago
Hi @aaaaaaaaaayong
The function export_pseudobulk
will generate a fragments file for each variable (for you cell_type
), so that's where this file is supposed to come from. This error is generated because this file is not generated in your case (while it should be).
This can be caused when the function is not able to match cell barcodes between your cell_data
dataframe and the fragments file.
The reason for this is probably because you appended sample ids to the barcodes in your fragments file ("500_" for example). It's not necessary to merge your fragment files before running pycistopic/SCENIC+. You can just specify multiple files in your fragments_dict
variable. Is there any particular reason why you are merging the files beforehand?
Also could you show the head of cell_data
and one of your fragment files?
All the best,
Seppe
@SeppeDeWinter
Thank you very much for your reply,but I still haven't solved the problem, I tried without merging and just inputting one fragments file, It still reports the error.
Also I checked the cell barcodes between my cell_data dataframe and the fragments file , and didn't find any difference between them. I added "_n" (for example "_1") at the end of the barcodes to distinguish different sample sources.
This is one of my fragment files:
chr1 10073 10351 CGCAATGTCACCTGTC-1_16 1
chr1 10079 10303 GCTTAACAGCGTGCGT-1_16 1
chr1 10083 10334 TGTAACTCAATCCCTT-1_16 1
chr1 10084 10334 AAAGCTTGTTTGGGCG-1_16 1
chr1 10085 10279 AATTGCTCATTGTTGG-1_16 1
chr1 10085 10303 TACTCAAAGGCCTTAG-1_16 1
chr1 10085 10308 CGTGAGGAGACTTATG-1_16 1
chr1 10085 10333 TCTTGACGTCAACAAT-1_16 1
chr1 10091 10279 TAGCCTCTCCCTGGAA-1_16 1
chr1 10091 10285 ATCGAGGCAGGGAGCT-1_16 1
chr1 10091 10315 TAGTAGGAGTGATTCA-1_16 1
chr1 10097 10277 CGCAATGTCCCAGTAG-1_16 1
chr1 10102 10279 GCTCACAAGTAGCTTA-1_16 1
And head of cell_data:
>>> cell_data
Cancer Case_ID Chemistry Piece_ID ... seurat_clusters total unmapped sample_id
CellID ...
AAACAGCCAAGGACCA-1_1 CESC CE340 Multiome CE340E1-S1N1 ... 4.0 6949.0 105.0 pancan
AAACAGCCATTAAAGG-1_1 CESC CE340 Multiome CE340E1-S1N1 ... 1.0 4585.0 74.0 pancan
AAACATGCACATGCTA-1_1 CESC CE340 Multiome CE340E1-S1N1 ... 1.0 12807.0 152.0 pancan
AAACATGCAGCTTAGC-1_1 CESC CE340 Multiome CE340E1-S1N1 ... 25.0 55895.0 609.0 pancan
AAACATGCATCCATCT-1_1 CESC CE340 Multiome CE340E1-S1N1 ... 1.0 7527.0 95.0 pancan
... ... ... ... ... ... ... ... ... ...
TTTGTGAAGTAGGATG-1_24 OV VF050V1 Multiome VF050V1-S2N1 ... 4.0 8639.0 127.0 pancan
TTTGTGGCAAACCCTA-1_24 OV VF050V1 Multiome VF050V1-S2N1 ... 17.0 11327.0 178.0 pancan
TTTGTGGCATAATCCG-1_24 OV VF050V1 Multiome VF050V1-S2N1 ... 17.0 9034.0 138.0 pancan
TTTGTGTTCTAATCCT-1_24 OV VF050V1 Multiome VF050V1-S2N1 ... 17.0 4181.0 73.0 pancan
TTTGTTGGTTATAGCG-1_24 OV VF050V1 Multiome VF050V1-S2N1 ... 4.0 10381.0 168.0 pancan
[131155 rows x 64 columns]
Thanks again for your help and Wish you all the best.
ayong
Hi Ayong
Can you try without adding the "_n", cistopic will automatically append sample ids to your cell barcodes.
All the best,
Seppe
HI HI Seppe
Your guidance helped me find an idea. This problem has been solved.
And because the data sources are different,there are a lot of duplications in their barcodes .So I added my own tags.
There are cores to solving the problem: 1)Index names must contain the BARCODE (e.g. ATGTCGTC-1), additional tags are possible separating with - (e.g. ATGCTGTGCG-1-Sample_1) 2)I added a column called 'barcode',which will be used instead of the index names 3) The keys of the dictionary need to match with the sample_id tag added to the index names of the input data frame(now that is 'barcode').
Such as:
>>> cell_data['barcode']
CellID
AAACAGCCAAGGACCA-1_1 AAACAGCCAAGGACCA-1-Sample_1
AAACAGCCATTAAAGG-1_1 AAACAGCCATTAAAGG-1-Sample_1
AAACATGCACATGCTA-1_1 AAACATGCACATGCTA-1-Sample_1
AAACATGCAGCTTAGC-1_1 AAACATGCAGCTTAGC-1-Sample_1
AAACATGCATCCATCT-1_1 AAACATGCATCCATCT-1-Sample_1
...
TTTGTGAAGTAGGATG-1_24 TTTGTGAAGTAGGATG-1-Sample_24
TTTGTGGCAAACCCTA-1_24 TTTGTGGCAAACCCTA-1-Sample_24
TTTGTGGCATAATCCG-1_24 TTTGTGGCATAATCCG-1-Sample_24
TTTGTGTTCTAATCCT-1_24 TTTGTGTTCTAATCCT-1-Sample_24
TTTGTTGGTTATAGCG-1_24 TTTGTTGGTTATAGCG-1-Sample_24
>>> cell_data['Sample_id']
CellID
AAACAGCCAAGGACCA-1_1 Sample_1
AAACAGCCATTAAAGG-1_1 Sample_1
AAACATGCACATGCTA-1_1 Sample_1
AAACATGCAGCTTAGC-1_1 Sample_1
AAACATGCATCCATCT-1_1 Sample_1
...
TTTGTGAAGTAGGATG-1_24 Sample_24
TTTGTGGCAAACCCTA-1_24 Sample_24
TTTGTGGCATAATCCG-1_24 Sample_24
TTTGTGTTCTAATCCT-1_24 Sample_24
TTTGTTGGTTATAGCG-1_24 Sample_24
less CE340E1-S1_fragments.tsv.gz
chr1 10072 10278 GTAGGATCACCTAATG-1-Sample_1 1
chr1 10079 10302 CGCCACACAAGCTAAA-1-Sample_1 1
chr1 10079 10327 ATTAGCTCACCATATG-1-Sample_1 1
chr1 10084 10436 GATGCTTAGGGACTAA-1-Sample_1 1
chr1 10090 10333 CCGTTTGGTGGGAACA-1-Sample_1 1
chr1 10131 10279 CCAACCAAGACAAAGT-1-Sample_1 1
chr1 10132 10279 CCAACCAAGACAAAGT-1-Sample_1 1
chr1 10151 10185 GGTACCGGTTGGTTAG-1-Sample_1 2
chr1 10151 10215 CTTGCATGTTTAACCC-1-Sample_1 2
chr1 10151 10216 GAGGGAGCAAGGCCAA-1-Sample_1 1
fragments_dict = {
'Sample_1': os.path.join(work_dir, 'scATAC/test/CE340E1-S1_fragments.tsv.gz'),
'Sample_2': os.path.join(work_dir, 'scATAC/test/CM1563C1-T1_fragments.tsv.gz'),
...
'Sample_23': os.path.join(work_dir, 'scATAC/test/VF034V1-T1_fragments.tsv.gz'),
'Sample_24': os.path.join(work_dir, 'scATAC/test/VF050V1-S2_fragments.tsv.gz')
}
Thanks for your help and Wish you all the best.
Ayong
Hi @aaaaaaaaaayong
Great to hear that your issue is solved.
I wish you all the best as well and good luck with your analyses.
I'm closing this issue now feel free to open new ones.
Seppe
Describe the bug
When I apply my data to export_pseudobulk function of SCENIC+ pipeline,I meet error. I don't know where this file(ADM.fragments.tsv.gz ) came from ,I have never imported this file.
Code
Error output
Additional context
This fragments file is created by merging multiple fragments files. I refer to this process
Version (please complete the following information):**