aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
163 stars 27 forks source link

error for export_pseudobulk #314

Closed aaaaaaaaaayong closed 3 months ago

aaaaaaaaaayong commented 3 months ago

Describe the bug

When I apply my data to export_pseudobulk function of SCENIC+ pipeline,I meet error. I don't know where this file(ADM.fragments.tsv.gz ) came from ,I have never imported this file.

Code

> work_dir = '/public/home/chenxy/data/Synapse/scenicplus/00_data.import'
> fragments_dict = {'pancan': os.path.join(work_dir, 'scATAC/fragments/gz/pancan_fragments.tsv.gz')}
> bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
>                  variable = 'cell_type',                                                                    
>                  sample_id_col = 'sample_id',
>                  chromsizes = chromsizes,
>                  bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  
>                  bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),
>                  path_to_fragments = fragments_dict,                                                     
>                  n_cpu = 8,                                                                                
>                  normalize_bigwig = True,
>                  temp_dir = os.path.join(tmp_dir, 'ray_spill'),
>                  split_pattern = '-')

Error output

 raise ValueError(f"Fragment file {path_to_fragment_file} does not exist.")
 ValueError: Fragment file /public/home/chenxy/data/Synapse/scenicplus/00_data.import/ray_spill/pancan/ADM.fragments.tsv.gz does not exist.

Additional context

This fragments file is created by merging multiple fragments files. I refer to this process

> gzip -dc atac_pbmc_500_nextgem_fragments.tsv.gz | awk 'BEGIN {FS=OFS="\t"} {print $1,$2,$3,"500_"$4,$5}' - > pbmc500_fragments.tsv
> gzip -dc atac_pbmc_1k_nextgem_fragments.tsv.gz | awk 'BEGIN {FS=OFS="\t"} {print $1,$2,$3,"1k_"$4,$5}' - > pbmc1k_fragments.tsv
> gzip -dc atac_pbmc_5k_nextgem_fragments.tsv.gz | awk 'BEGIN {FS=OFS="\t"} {print $1,$2,$3,"5k_"$4,$5}' - > pbmc5k_fragments.tsv
> gzip -dc atac_pbmc_10k_nextgem_fragments.tsv.gz | awk 'BEGIN {FS=OFS="\t"} {print $1,$2,$3,"10k_"$4,$5}' - > pbmc10k_fragments.tsv
> sort -m -k 1,1V -k2,2n pbmc500_fragments.tsv pbmc1k_fragments.tsv pbmc5k_fragments.tsv pbmc10k_fragments.tsv > fragments.tsv
> bgzip -@ 4 fragments.tsv
> tabix -p bed fragments.tsv.gz

Version (please complete the following information):**

SeppeDeWinter commented 3 months ago

Hi @aaaaaaaaaayong

The function export_pseudobulk will generate a fragments file for each variable (for you cell_type), so that's where this file is supposed to come from. This error is generated because this file is not generated in your case (while it should be).

This can be caused when the function is not able to match cell barcodes between your cell_data dataframe and the fragments file.

The reason for this is probably because you appended sample ids to the barcodes in your fragments file ("500_" for example). It's not necessary to merge your fragment files before running pycistopic/SCENIC+. You can just specify multiple files in your fragments_dict variable. Is there any particular reason why you are merging the files beforehand?

Also could you show the head of cell_data and one of your fragment files?

All the best,

Seppe

aaaaaaaaaayong commented 3 months ago

@SeppeDeWinter

Thank you very much for your reply,but I still haven't solved the problem, I tried without merging and just inputting one fragments file, It still reports the error.

Also I checked the cell barcodes between my cell_data dataframe and the fragments file , and didn't find any difference between them. I added "_n" (for example "_1") at the end of the barcodes to distinguish different sample sources.

This is one of my fragment files:

chr1    10073   10351   CGCAATGTCACCTGTC-1_16   1
chr1    10079   10303   GCTTAACAGCGTGCGT-1_16   1
chr1    10083   10334   TGTAACTCAATCCCTT-1_16   1
chr1    10084   10334   AAAGCTTGTTTGGGCG-1_16   1
chr1    10085   10279   AATTGCTCATTGTTGG-1_16   1
chr1    10085   10303   TACTCAAAGGCCTTAG-1_16   1
chr1    10085   10308   CGTGAGGAGACTTATG-1_16   1
chr1    10085   10333   TCTTGACGTCAACAAT-1_16   1
chr1    10091   10279   TAGCCTCTCCCTGGAA-1_16   1
chr1    10091   10285   ATCGAGGCAGGGAGCT-1_16   1
chr1    10091   10315   TAGTAGGAGTGATTCA-1_16   1
chr1    10097   10277   CGCAATGTCCCAGTAG-1_16   1
chr1    10102   10279   GCTCACAAGTAGCTTA-1_16   1

And head of cell_data:

>>> cell_data
                      Cancer  Case_ID Chemistry      Piece_ID  ...  seurat_clusters    total unmapped sample_id
CellID                                                         ...                                             
AAACAGCCAAGGACCA-1_1    CESC    CE340  Multiome  CE340E1-S1N1  ...              4.0   6949.0    105.0    pancan
AAACAGCCATTAAAGG-1_1    CESC    CE340  Multiome  CE340E1-S1N1  ...              1.0   4585.0     74.0    pancan
AAACATGCACATGCTA-1_1    CESC    CE340  Multiome  CE340E1-S1N1  ...              1.0  12807.0    152.0    pancan
AAACATGCAGCTTAGC-1_1    CESC    CE340  Multiome  CE340E1-S1N1  ...             25.0  55895.0    609.0    pancan
AAACATGCATCCATCT-1_1    CESC    CE340  Multiome  CE340E1-S1N1  ...              1.0   7527.0     95.0    pancan
...                      ...      ...       ...           ...  ...              ...      ...      ...       ...
TTTGTGAAGTAGGATG-1_24     OV  VF050V1  Multiome  VF050V1-S2N1  ...              4.0   8639.0    127.0    pancan
TTTGTGGCAAACCCTA-1_24     OV  VF050V1  Multiome  VF050V1-S2N1  ...             17.0  11327.0    178.0    pancan
TTTGTGGCATAATCCG-1_24     OV  VF050V1  Multiome  VF050V1-S2N1  ...             17.0   9034.0    138.0    pancan
TTTGTGTTCTAATCCT-1_24     OV  VF050V1  Multiome  VF050V1-S2N1  ...             17.0   4181.0     73.0    pancan
TTTGTTGGTTATAGCG-1_24     OV  VF050V1  Multiome  VF050V1-S2N1  ...              4.0  10381.0    168.0    pancan

[131155 rows x 64 columns]

Thanks again for your help and Wish you all the best.

ayong

SeppeDeWinter commented 3 months ago

Hi Ayong

Can you try without adding the "_n", cistopic will automatically append sample ids to your cell barcodes.

All the best,

Seppe

aaaaaaaaaayong commented 3 months ago

HI HI Seppe

Your guidance helped me find an idea. This problem has been solved.

And because the data sources are different,there are a lot of duplications in their barcodes .So I added my own tags.

There are cores to solving the problem: 1)Index names must contain the BARCODE (e.g. ATGTCGTC-1), additional tags are possible separating with - (e.g. ATGCTGTGCG-1-Sample_1) 2)I added a column called 'barcode',which will be used instead of the index names 3) The keys of the dictionary need to match with the sample_id tag added to the index names of the input data frame(now that is 'barcode').

Such as:

>>> cell_data['barcode']
CellID
AAACAGCCAAGGACCA-1_1      AAACAGCCAAGGACCA-1-Sample_1
AAACAGCCATTAAAGG-1_1      AAACAGCCATTAAAGG-1-Sample_1
AAACATGCACATGCTA-1_1      AAACATGCACATGCTA-1-Sample_1
AAACATGCAGCTTAGC-1_1      AAACATGCAGCTTAGC-1-Sample_1
AAACATGCATCCATCT-1_1      AAACATGCATCCATCT-1-Sample_1
                                     ...             
TTTGTGAAGTAGGATG-1_24    TTTGTGAAGTAGGATG-1-Sample_24
TTTGTGGCAAACCCTA-1_24    TTTGTGGCAAACCCTA-1-Sample_24
TTTGTGGCATAATCCG-1_24    TTTGTGGCATAATCCG-1-Sample_24
TTTGTGTTCTAATCCT-1_24    TTTGTGTTCTAATCCT-1-Sample_24
TTTGTTGGTTATAGCG-1_24    TTTGTTGGTTATAGCG-1-Sample_24
>>> cell_data['Sample_id']
CellID
AAACAGCCAAGGACCA-1_1      Sample_1
AAACAGCCATTAAAGG-1_1      Sample_1
AAACATGCACATGCTA-1_1      Sample_1
AAACATGCAGCTTAGC-1_1      Sample_1
AAACATGCATCCATCT-1_1      Sample_1
                           ...    
TTTGTGAAGTAGGATG-1_24    Sample_24
TTTGTGGCAAACCCTA-1_24    Sample_24
TTTGTGGCATAATCCG-1_24    Sample_24
TTTGTGTTCTAATCCT-1_24    Sample_24
TTTGTTGGTTATAGCG-1_24    Sample_24
less CE340E1-S1_fragments.tsv.gz
chr1    10072   10278   GTAGGATCACCTAATG-1-Sample_1     1
chr1    10079   10302   CGCCACACAAGCTAAA-1-Sample_1     1
chr1    10079   10327   ATTAGCTCACCATATG-1-Sample_1     1
chr1    10084   10436   GATGCTTAGGGACTAA-1-Sample_1     1
chr1    10090   10333   CCGTTTGGTGGGAACA-1-Sample_1     1
chr1    10131   10279   CCAACCAAGACAAAGT-1-Sample_1     1
chr1    10132   10279   CCAACCAAGACAAAGT-1-Sample_1     1
chr1    10151   10185   GGTACCGGTTGGTTAG-1-Sample_1     2
chr1    10151   10215   CTTGCATGTTTAACCC-1-Sample_1     2
chr1    10151   10216   GAGGGAGCAAGGCCAA-1-Sample_1     1
fragments_dict = {
                  'Sample_1': os.path.join(work_dir, 'scATAC/test/CE340E1-S1_fragments.tsv.gz'),
                  'Sample_2': os.path.join(work_dir, 'scATAC/test/CM1563C1-T1_fragments.tsv.gz'),
                                                                        ...    
                  'Sample_23': os.path.join(work_dir, 'scATAC/test/VF034V1-T1_fragments.tsv.gz'),
                  'Sample_24': os.path.join(work_dir, 'scATAC/test/VF050V1-S2_fragments.tsv.gz')
                  }

Thanks for your help and Wish you all the best.

Ayong

SeppeDeWinter commented 3 months ago

Hi @aaaaaaaaaayong

Great to hear that your issue is solved.

I wish you all the best as well and good luck with your analyses.

I'm closing this issue now feel free to open new ones.

Seppe