export_pseudobulk never generates pseudobulk files

SITRR commented 7 months ago

Hi!

I'm having an issue with the export_pseudobulk function, which has been previously 41. The function runs but doesn't output any files and in the text output does not state "Done!" after "Creating pseudobulk for 75G" (75G being the name of one of my samples).

I've generated a cell_data data frame including three columns, 'barcode' which has the same barcode format as the fragment files (e.g. 'TTTGTGTTCTTGTCGC-1; I got these from the CellRanger output singlecell.csv file let me know if that's not advisable), a second column called 'sample_id' with the name of the sample and fragments_dict key (e.g. 75G; repeated for the length of the 'barcode' column), and a third column 'donor' identical to the 'sample_id' column. I use the 'donor' column as the input for 'variable' in the export_pseudobulk function (see below).

Should I be extracting the barcodes from the fragments file directly? I'm new to this and don't know how to do so on python.

This is the cell_data dataframe:

Screen Shot 2024-03-21 at 1 55 52 PM

This is the fragment file unzipped:

This is the code I'm running:


fragments_dict = {
    '75G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz',
    '8561G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_8561G_ATAC/outs/fragments.tsv.gz'}
path_to_regions = {
    '75G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/filtered_peak_bc_matrix/peaks.bed',
    '8561G': '/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_8561G_ATAC/outs/filtered_peak_bc_matrix/peaks.bed'}
path_to_blacklist= '/sc/arion/projects/DTR-EFGR/GBMATAC_SIR/Analysis/CopyscAT/hg38-blacklist.v2.bed'

import pandas as pd
df_75G = pd.read_csv(work_dir + 'scATAC_consensus_peakset/quality_control/barcodes/barcodes_75G.csv', header = 0, sep = ',', index_col = 0)
df_8561G = pd.read_csv(work_dir + 'scATAC_consensus_peakset/quality_control/barcodes/barcodes_8561G.csv', header = 0, sep = ',', index_col = 0)
print(df_75G.head())
print(df_8561G.head())

frames = [df_75G, df_8561G]
cell_data = pd.concat(frames) 
print(cell_data.head())
print(cell_data.tail())

import pyranges as pr
import requests
import pandas as pd
target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
chromsizes=pd.read_csv(target_url, sep='\t', header=None)
chromsizes.columns=['Chromosome', 'End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].replace('v', '.') for x in range(len(chromsizes['Chromosome']))]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].split('_')[1] if len(chromsizes['Chromosome'][x].split('_')) > 1 else chromsizes['Chromosome'][x] for x in range(len(chromsizes['Chromosome']))]
chromsizes=pr.PyRanges(chromsizes)

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
import ray
ray.shutdown()
#sys.stderr = open(os.devnull, "w")  # silence stderr
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'donor',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bed_files/'),   # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bw_files/'), # specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                         # location of fragment fiels
                 n_cpu = 8,                                                                                  # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = os.path.join(temp_dir, 'ray_spill'))#,
                 #split_pattern = '_')
#sys.stderr = sys.__stderr__  # unsilence stderr

import pickle
pickle.dump(bed_paths,
            open(os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bed_files/bed_paths.pkl'), 'wb'))
pickle.dump(bw_paths,
           open(os.path.join(work_dir, 'scATAC_consensus_peakset/consensus_peak_calling/pseudobulk_bed_files/bw_paths.pkl'), 'wb'))

SeppeDeWinter commented 6 months ago

Hi @SITRR

Looks allright what you are doing.

Could you run the following piece of code and send me the output?


sample = "75G"
variable = "75G"
_sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample]
_cell_type_to_cell_barcodes = _sample_cell_data \
    .groupby(variable, group_keys=False)["barcode"] \
    .apply(list) \
    .to_dict()

print(_cell_type_to_cell_barcodes)

Best,

Seppe

SITRR commented 6 months ago

Hi @SeppeDeWinter,

Disclaimer: I'm an absolute beginner with python.

The line _sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample] gives the error:

Traceback (most recent call last): File "", line 1, in NameError: name '_sample_cell_data' is not defined

I thought to change sample_id_col to 'sample_id' (please refer to the disclaimer for my reasoning) but then _cell_type_to_cell_barcodes gives the error:

Traceback (most recent call last): File "", line 1, in File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 8392, in groupby return DataFrameGroupBy( File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 959, in init grouper, exclusions, obj = get_grouper( File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 889, in get_grouper raise KeyError(gpr) KeyError: '75G'

Best, Susana

SeppeDeWinter commented 6 months ago

Hi @SITRR

Can you show how you cell_data looks like?


print(cell_data)

All the best,

Seppe

SITRR commented 6 months ago

Hi @SeppeDeWinter,

Screen Shot 2024-03-26 at 11 40 15 AM

Thanks (seriously, I will be forever indebted), Susana

SeppeDeWinter commented 6 months ago

Hi Susana

My bad, the code should be like this


sample = "75G"
variable = "donor"
sample_id_col = "sample_id"
_sample_cell_data = cell_data.loc[cell_data[sample_id_col] == sample]
_cell_type_to_cell_barcodes = _sample_cell_data \
    .groupby(variable, group_keys=False)["barcode"] \
    .apply(list) \
    .to_dict()

print(_cell_type_to_cell_barcodes)

SITRR commented 6 months ago

Hi, @SeppeDeWinter!

It printed a huge list of barcodes:

Screen Shot 2024-04-01 at 10 48 44 AM

Is that to be expected? Should the output of len(_cell_type_to_cell_barcodes) be 1? I'm presuming that it shouldn't.

Best, Susana

SeppeDeWinter commented 6 months ago

Hi @SITRR

That looks allright, I don't immediately something that is off. The function does not produce any output at all, also not for the other clusters?

Are you running it in jupyter notebooks or via the command line environment? Could it be that your job crashed?

All the best,

Seppe

SITRR commented 6 months ago

Hi, @SeppeDeWinter!

I’m running it via the command line environment, as an lsf job. It runs successfully but only outputs bed_paths.pkl and bw_paths.pkl files, nothing else. The output also doesn't say "Done!" as shown in the tutorial.

I was really hoping to get this to work since there is huge batch effect when I don't use a common peak set.

I wouldn't want to go forward with this cistopic object.

Best, Susana

SITRR commented 6 months ago

I tried following this user's pipeline to verify each input 59, but get an error when trying to read the fragments file (unless I'm misunderstanding their code or their input files have a different format).

fragments = pd.read_table("/sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz", header = None) Traceback (most recent call last): File "", line 1, in File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper return func(*args, *kwargs) File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 317, in wrapper return func(args, **kwargs) File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1289, in read_table return _read(filepath_or_buffer, kwds) File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 611, in _read return parser.read(nrows) File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1772, in read ) = self._engine.read( # type: ignore[attr-defined] File "/sc/arion/work/ramoss18/.conda/envs/SCENICplus_v3.8/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 243, in read chunks = self._reader.read_low_memory(nrows) File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 52, saw 5

Does this give you any hints?

Cheers, Susana

SeppeDeWinter commented 6 months ago

Hi @SITRR

Could you try running the command line version of this code. Please see the documentation on how to do this https://aertslab.github.io/scatac_fragment_tools/split.html.

This will involve generating:

A tsv file containing a column with sample ids and a column with fragment files (this can be created from your fragments_dict).
a tsv file < PATH_TO_CELL_TYPE_TO_CELL_BARCODE_DEFINITION > containing a sample, cell_type and cell_barcode (this can be created from your cell data dataframe)
Downloading CHROM_SIZES_FILENAME (wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes).

After this you can run


scatac_fragment_tools split \
    -f <PATH_TO_SAMPLE_TO_FRAGMENT_DEFINITION> \
    -b <PATH_TO_CELL_TYPE_TO_CELL_BARCODE_DEFINITION> \
    -c <CHROM_SIZES_FILENAME> \
    -o <PATH_TO_OUTPUT_FOLDER>

And this will do the same splitting as the function is trying to do.

All the best,

Seppe

SITRR commented 6 months ago

Thank you, @SeppeDeWinter!

How do I proceed after running this command? I'd like to extract the list of barcodes from the fragments file it generated to compare to the list I've been using to run the export_pseudobulk command as follows, but I get a large 5.2 GB file, which doesn't seem right to me.

df = pd.read_table(work_dir + "Testing/75G.fragments.tsv.gz", header=None) barcodes = df[df.columns[3]] barcodes.to_csv(work_dir + 'Testing/barcodes.tsv', sep = "\t")

I tried using the new fragments file (output from scatac_fragment_tools split) but I'm getting the same results.

Best, Susana

SeppeDeWinter commented 6 months ago

Hi @SITRR

I'm not sure wether I'm completely understanding your question... If this command ran successfully you can just continue with your analysis.

To get the bed_paths variable (the one that is used in the tutorial, run)

from scatac_fragment_tools.library.split.split_fragments_by_cell_type import (
    _santize_string_for_filename
)

bed_path = <PATH_TO_OUTPUT_FOLDER>
variable = "donor" #variable used to split the fragments

bed_paths = {}
for cell_type in cell_data[variable].unique():
    _bed_fname = os.path.join(
        bed_path,
        f"{_santize_string_for_filename(cell_type)}.fragments.tsv.gz")
    if os.path.exists(_bed_fname):
        bed_paths[cell_type] = _bed_fname
    else:
        print(f"Missing fragments for {cell_type}!")

Best,

S

SITRR commented 5 months ago

Hi @SeppeDeWinter!

I finally got the export_pseudobulk function to run by switching to Python v3.11 and the dev branch of SCENIC+. Oddly though, it only runs with two samples. The function immediately outputs the following message:

cisTopic INFO Reading fragments from /sc/arion/projects/DTR-EFGR/data_delivery/ATACseq_Human/TD005909_Nadejda_Tsankova/Sample_75G_ATAC/outs/fragments.tsv.gz

However, when I add two more samples to the fragment directory, the function outputs the following instead and gets stuck on this step for several hours:

INFO Splitting fragments by cell type.

Any idea what might be happening?

Best, Susana

SITRR commented 5 months ago

Crisis averted, @SeppeDeWinter!

Patience was all that was necessary. It's running smoothly now, but I'd like to keep the issue open until I get to the end of the pipeline as I might encounter other issues and would appreciate your guidance. If you'd like me to open a separate issue instead, I can do so.

Thank you for everything so far!

aertslab / pycisTopic

export_pseudobulk never generates pseudobulk files #117