aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
186 stars 29 forks source link

QC files are not formed #450

Closed gilgolan73 closed 2 months ago

gilgolan73 commented 3 months ago

Describe the bug I try to run the tutorial for pycistopic (using 10x brain multiome data), on a centos 9 virtual machine. In the QC step, the relevant QC files are not formed, and the downstream commands don't work (error massages that files are missing). Please help me fix this issue. Thank you, Gil

To Reproduce

"from pycisTopic.plotting.qc_plot import plot_sample_stats, plot_barcode_stats
import matplotlib.pyplot as plt
for sample_id in fragments_dict:
    fig = plot_sample_stats(
        sample_id = sample_id,
        pycistopic_qc_output_dir = "outs/qc"
    )"

Error output Paste the entire output of the command, including log information prior to the error.

"Traceback (most recent call last):
  File "/home/gilgolan/.local/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-48-aeffa7ce86db>", line 2, in <module>
    fig = plot_sample_stats(
          ^^^^^^^^^^^^^^^^^^
  File "/home/gilgolan/.local/lib/python3.11/site-packages/pycisTopic/plotting/qc_plot.py", line 171, in plot_sample_stats
    raise FileNotFoundError(
FileNotFoundError: Could not find 10x_multiome_brain.fragments_insert_size_dist.parquet in outs/qc"
1. 10x_multiome_brain.pycistopic_qc.log
2. hg38.chrom_sizes_and_alias.tsv
3. tss.bed
"cat 10x_multiome_brain.pycistopic_qc.log 
2024-08-20 12:02:33,916 1s - INFO - pycisTopic.cli.subcommand.qc:qc - Loading TSS annotation from "outs/qc/tss.bed".
2024-08-20 12:02:34,091 1s - INFO - pycisTopic.cli.subcommand.qc:qc - Loading regions BED file from "outs/consensus_peak_calling/consensus_regions.bed".
2024-08-20 12:02:34,407 2s - INFO - pycisTopic.cli.subcommand.qc:qc - Loading fragments TSV file from "data/fragments.tsv.gz".
2024-08-20 12:03:48,065 75s - INFO - pycisTopic.cli.subcommand.qc:qc - Computing QC stats.
2024-08-20 12:03:48,077 75s - INFO - pycisTopic.qc:compute_qc_stats - Get basic fragments statistics per cell barcode.
2024-08-20 12:04:41,010 128s - INFO - pycisTopic.qc:compute_qc_stats - Get total fragment counts and unique fragment counts per region.
2024-08-20 12:05:52,705 200s - INFO - pycisTopic.qc:compute_qc_stats - Add fragment counts per region to fragments statistics per cell barcode.
2024-08-20 12:05:52,711 200s - INFO - pycisTopic.qc:compute_qc_stats - Get insert size distribution of fragments.
2024-08-20 12:05:59,830 207s - INFO - pycisTopic.qc:compute_qc_stats - Get TSS profile for fragments."

Expected behavior Sample level statistic graphs should appear.

Screenshots If applicable, add screenshots to help explain your problem or show the format of the input data for the command/s.

Version (please complete the following information):

Additional context the output of the previous commands in the QC section:

"path_to_blacklist="data/hg38-blacklist.v2.bed"
consensus_peaks = get_consensus_peaks(
    narrow_peaks_dict = narrow_peak_dict,
    peak_half_width = peak_half_width,
    chromsizes = chromsizes,
    path_to_blacklist = path_to_blacklist)
2024-08-20 11:31:01,658 cisTopic     INFO     Extending and merging peaks per class
2024-08-20 11:31:38,702 cisTopic     INFO     Normalizing peak scores
2024-08-20 11:31:38,978 cisTopic     INFO     Merging peaks
Warning! Start and End columns now have different dtypes: int64 and int32
2024-08-20 11:32:18,723 cisTopic     INFO     Done!
consensus_peaks.to_bed(
    path = os.path.join(out_dir, "consensus_peak_calling/consensus_regions.bed"),
    keep =True,
    compression = 'infer',
    chain = False)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
/home/gilgolan/.local/lib/python3.11/site-packages/pyranges/out.py:37: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  return pd.concat([outdf, df.get(noncanonical)], axis=1)
!pycistopic tss gene_annotation_list | grep Human
hsapiens_gene_ensembl   Human genes (GRCh38.p14)
mkdir -p outs/qc
!pycistopic tss get_tss \
--output outs/qc/tss.bed \
    --name "hsapiens_gene_ensembl" \
    --to-chrom-source ucsc \
    --ucsc hg38
- Get TSS annotation from Ensembl BioMart with the following settings:
  - biomart_name: "hsapiens_gene_ensembl"
  - biomart_host: "http://www.ensembl.org"
  - transcript_type: ['protein_coding']
  - use_cache: True
- Getting chromosome sizes and alias mapping for "hg38" from UCSC.
- Update chromosome names in TSS annotation to "ucsc" chromosome names.
- Writing TSS annotation BED file to "outs/qc/tss.bed".
!head outs/qc/tss.bed | column -t
#     Chromosome  Start  End      Gene  Score  Strand          Transcript_type
chrM  3306        3307   MT-ND1   .     +      protein_coding  
chrM  4469        4470   MT-ND2   .     +      protein_coding  
chrM  5903        5904   MT-CO1   .     +      protein_coding  
chrM  7585        7586   MT-CO2   .     +      protein_coding  
chrM  8365        8366   MT-ATP8  .     +      protein_coding  
chrM  8526        8527   MT-ATP6  .     +      protein_coding  
chrM  9206        9207   MT-CO3   .     +      protein_coding  
chrM  10058       10059  MT-ND3   .     +      protein_coding  
chrM  10469       10470  MT-ND4L  .     +      protein_coding  
!pycistopic qc \
--fragments data/fragments.tsv.gz \
    --regions outs/consensus_peak_calling/consensus_regions.bed \
    --tss outs/qc/tss.bed \
    --output outs/qc/10x_multiome_brain
"
ghuls commented 3 months ago

Probably you ran out of RAM in the "Get TSS profile for fragment" step. Try to reserve more RAM (several 10-s of GB) for your job and run pycistopic QC again:

pycistopic qc \
    --fragments data/fragments.tsv.gz \
    --regions outs/consensus_peak_calling/consensus_regions.bed \
    --tss outs/qc/tss.bed \
    --output outs/qc/10x_multiome_brain
# After:
2024-08-20 12:05:59,830 207s - INFO - pycisTopic.qc:compute_qc_stats - Get TSS profile for fragments."

# You should see the following output:
2024-07-16 09:55:38,677 41s - INFO - pycisTopic.qc:compute_qc_stats - Add TSS enrichment to fragments statistics per cell barcode.
2024-07-16 09:55:38,681 41s - INFO - pycisTopic.qc:compute_qc_stats - Calculate KDE for log10 unique fragments in peaks vs TSS enrichment.
2024-07-16 09:55:41,186 44s - INFO - pycisTopic.qc:compute_qc_stats - Calculate KDE for log10 unique fragments in peaks vs fractions of fragments in peaks.
2024-07-16 09:55:43,130 45s - INFO - pycisTopic.qc:compute_qc_stats - Calculate KDE for log10 unique fragments in peaks vs duplication ratio.
2024-07-16 09:55:44,046 46s - INFO - pycisTopic.qc:compute_qc_stats - Add probability density function (PDF) values to fragments statistics per cell barcode.
2024-07-16 09:55:44,046 46s - INFO - pycisTopic.cli.subcommand.qc:qc - Writing "outs/qc/10x_multiome_brain.fragments_stats_per_cb.parquet".
2024-07-16 09:55:44,096 46s - INFO - pycisTopic.cli.subcommand.qc:qc - Writing "outs/qc/10x_multiome_brain4.fragments_insert_size_dist.parquet".
2024-07-16 09:55:44,097 46s - INFO - pycisTopic.cli.subcommand.qc:qc - Writing "outs/qc/10x_multiome_brain.tss_norm_matrix_sample.parquet".
2024-07-16 09:55:44,098 46s - INFO - pycisTopic.cli.subcommand.qc:qc - Writing "outs/qc/10x_multiome_brain.tss_norm_matrix_per_cb.parquet".
2024-07-16 09:55:51,346 54s - INFO - pycisTopic.cli.subcommand.qc:qc - Calculating Otsu thresholds.
2024-07-16 09:55:51,356 54s - INFO - pycisTopic.cli.subcommand.qc:qc - Writing "outs/qc/10x_multiome_brain.fragments_stats_per_cb_for_otsu_thresholds.parquet".
2024-07-16 09:55:51,373 54s - INFO - pycisTopic.cli.subcommand.qc:qc - Writing "outs/qc/10x_multiome_brain.fragments_stats_per_cb_for_otsu_thresholds.tsv".
2024-07-16 09:55:51,375 54s - INFO - pycisTopic.cli.subcommand.qc:qc - Writing "outs/qc/10x_multiome_brain.cbs_for_otsu_thresholds.tsv".
2024-07-16 09:55:51,376 54s - INFO - pycisTopic.cli.subcommand.qc:qc - Writing "outs/qc/10x_multiome_brain.otsu_thresholds.tsv".
2024-07-16 09:55:51,376 54s - INFO - pycisTopic.cli.subcommand.qc:qc - pycisTopic QC finished.
gilgolan73 commented 2 months ago

Hi, I am currently using a linux virtual machine allocated with ~24gb of RAM (the maximum I can allocate). I understand that I need to buy more RAM for my computer- but how much will be enough for completing the entire scenicplus pipeline?

Thank you, Gil

ghuls commented 2 months ago

The amount of memory needed depends on your dataset. With a bit of luck in the future less RAM will be needed. Polars is currently working on a streaming engine which should be able to handle bigger than memory dataframe operations.

ghuls commented 2 months ago

You can try to increase the min_fragments_per_cb parameter from 10 to a higher value, to reduce the memory usage. (Try to not increase it too much as otherwise the generate QC plots will be less interesting as all values for cell barcdes that appear less than that threshold value will not be calculated/plotted).

pycistopic qc ... --min_fragments_per_cb 20
gilgolan73 commented 2 months ago

Hi, thank you for the help. I bought more RAM (now have 128 GB) and the QC seems to be working well.