Chromsize file issue, snakemake doesn't proceed

yojetsharma commented 1 month ago

I am facing issues with the chromsizes file not getting saved. I have gone through the issues #429 #468 and implemented the solution offered in #357. But to no avail. As suggested in #429 I tried manually downloading these files as follows:

(scenicplus) [yojetsharma@pakeeza outs]$ scenicplus prepare_data download_genome_annotations \
> --species "hsapiens" \
> --genome_annotation_out_fname "/home/praghu/yojetsharma/pycistopic_final/outs/genome_annotation.tsv" \
> --chromsizes_out_fname "/home/praghu/yojetsharma/pycistopic_final/outs/chromsizes.tsv"
2024-10-17 11:29:33,134 Download gene annotation INFO     Using genome: GRCh38.p14
Could not find IdList on https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=genome&term=GRCh38.p14
Returning gene annotation without subestting for assembled chromosomesand converting to UCSC style. Please make sure that the chromosome namesin the returned object match with the chromosome names in the scplus_obj.Chromosome sizes will not be returned
2024-10-17 11:29:34,268 SCENIC+      INFO     Chrosomome sizes was not found, please provide this information manually.
2024-10-17 11:29:34,269 SCENIC+      INFO     Saving genome annotation to: /home/praghu/yojetsharma/pycistopic_final/outs/genome_annotation.tsv

But still get the error that chromosome sizes not found. Then I downloaded the hg38.chrom.sizes file manually and ran the pipeline again but still get this error:

(scenicplus) [yojetsharma@pakeeza Snakemake]$ snakemake --cores 20
Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job stats:
job                            count
---------------------------  -------
AUCell_direct                      1
AUCell_extended                    1
all                                1
download_genome_annotations        1
eGRN_direct                        1
eGRN_extended                      1
get_search_space                   1
motif_enrichment_dem               1
prepare_menr                       1
region_to_gene                     1
scplus_mudata                      1
tf_to_gene                         1
total                             12

Select jobs to execute...
Execute 1 jobs...

[Thu Oct 17 11:45:29 2024]
localrule download_genome_annotations:
    output: /home/praghu/yojetsharma/pycistopic_final/outs/genome_annotation.tsv, /home/praghu/yojetsharma/pycistopic_final/outs/chromsizes
    jobid: 8
    reason: Missing output files: /home/praghu/yojetsharma/pycistopic_final/outs/chromsizes
    resources: tmpdir=/tmp

2024-10-17 11:47:11,805 Download gene annotation INFO     Using genome: GRCh38.p14
Could not find IdList on https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=genome&term=GRCh38.p14
Returning gene annotation without subestting for assembled chromosomesand converting to UCSC style. Please make sure that the chromosome namesin the returned object match with the chromosome names in the scplus_obj.Chromosome sizes will not be returned
2024-10-17 11:47:11,816 SCENIC+      INFO     Chrosomome sizes was not found, please provide this information manually.
2024-10-17 11:47:11,816 SCENIC+      INFO     Saving genome annotation to: /home/praghu/yojetsharma/pycistopic_final/outs/genome_annotation.tsv
Waiting at most 5 seconds for missing files.
MissingOutputException in rule download_genome_annotations in file /ncbs_gs/nlsas_data/usershares/praghu/yojetsharma/pycistopic_final/scplus_pipeline/Snakemake/workflow/Snakefile, line 221:
Job 8  completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
/home/praghu/yojetsharma/pycistopic_final/outs/chromsizes
Removing output files of failed job download_genome_annotations since they might be corrupted:
/home/praghu/yojetsharma/pycistopic_final/outs/genome_annotation.tsv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-10-17T114523.626937.snakemake.log
WorkflowError:
At least one job did not complete successfully.

Checking the Chromosomes in the metadata_regions of scplus_obj:

motif_enrichment_dict = {}
>>> scplus_obj=create_SCENICPLUS_obj(GEX_anndata=rna, cisTopic_obj=cistopic_obj,menr=motif_enrichment_dict)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'create_SCENICPLUS_obj' is not defined. Did you mean: 'create_SCENICPLUS_object'?
>>> scplus_obj=create_SCENICPLUS_object(GEX_anndata=rna, cisTopic_obj=cistopic_obj,menr=motif_enrichment_dict)
2024-10-17 12:53:33,717 cisTopic     INFO     Impute region accessibility for regions 360000-380000
2024-10-17 12:53:40,837 cisTopic     INFO     Impute region accessibility for regions 380000-400000
2024-10-17 12:53:41,439 cisTopic     INFO     Done!
>>> scplus_obj
SCENIC+ object with n_cells x n_genes = 51178 x 30761 and n_cells x n_regions = 51178 x 381351
    metadata_regions:'Chromosome', 'Start', 'End', 'Width', 'cisTopic_nr_frag', 'cisTopic_log_nr_frag', 'cisTopic_nr_acc', 'cisTopic_log_nr_acc'
    metadata_genes:'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    metadata_cell:'GEX_sample', 'GEX_n_genes_by_counts', 'GEX_log1p_n_genes_by_counts', 'GEX_total_counts', 'GEX_log1p_total_counts', 'GEX_pct_counts_in_top_50_genes', 'GEX_pct_counts_in_top_100_genes', 'GEX_pct_counts_in_top_200_genes', 'GEX_pct_counts_in_top_500_genes', 'GEX_total_counts_mt', 'GEX_log1p_total_counts_mt', 'GEX_pct_counts_mt', 'GEX_total_counts_ribo', 'GEX_log1p_total_counts_ribo', 'GEX_pct_counts_ribo', 'GEX_total_counts_hb', 'GEX_log1p_total_counts_hb', 'GEX_pct_counts_hb', 'GEX_n_genes', 'GEX_doublet_score', 'GEX_predicted_doublet', 'GEX_leiden', 'GEX_leiden_res_0.02', 'GEX_leiden_res_0.50', 'GEX_leiden_res_2.00', 'GEX_snapseed_annot', 'GEX_scanvi_snapseed_annot', 'GEX_broader_annotation', 'ACC_duplication_ratio', 'ACC_cisTopic_log_nr_acc', 'ACC_pdf_values_for_duplication_ratio', 'ACC_sample_id', 'ACC_unique_fragments_count', 'ACC_cisTopic_nr_acc', 'ACC_total_fragments_in_peaks_count', 'ACC_log10_unique_fragments_count', 'ACC_cisTopic_log_nr_frag', 'ACC_fraction_of_fragments_in_peaks', 'ACC_unique_fragments_in_peaks_count', 'ACC_pdf_values_for_fraction_of_fragments_in_peaks', 'ACC_log10_total_fragments_in_peaks_count', 'ACC_barcode', 'ACC_tss_enrichment', 'ACC_barcode_rank', 'ACC_duplication_count', 'ACC_broader_annotation', 'ACC_pdf_values_for_tss_enrichment', 'ACC_log10_unique_fragments_in_peaks_count', 'ACC_total_fragments_count', 'ACC_log10_total_fragments_count', 'ACC_nucleosome_signal', 'ACC_cisTopic_nr_frag', 'ACC_Doublet_scores_fragments', 'ACC_Predicted_doublets_fragments', 'ACC_pycisTopic_leiden_10_0.6', 'ACC_pycisTopic_leiden_10_1.2', 'ACC_pycisTopic_leiden_10_3'
    dr_cell:'GEX_X_pca', 'GEX_X_umap', 'ACC_UMAP', 'ACC_tSNE'
>>> print(scplus_obj.metadata_regions['Chromosome'].unique())
['chr5' 'chr4' 'chr8' 'chr9' 'chr6' 'chr2' 'chr14' 'chr3' 'chr17' 'chr19'
 'chr15' 'chr11' 'chr18' 'chr21' 'chr7' 'chr1' 'chr16' 'chr13' 'chr10'
 'chr20' 'chr22' 'chr12' 'chrX' 'chrY']

vcleon88 commented 4 weeks ago

sorry i am facing the same problem. i just wondering how to fix it?

yojetsharma commented 4 weeks ago

sorry i am facing the same problem. i just wondering how to fix it?

I solved it by following the pycistopic tutorial where chromsizes file is generated:

chromsizes = pd.read_table(os.path.join(out_dir, "qc", "hg38.chrom_sizes_and_alias.tsv"))
chromsizes
chromsizes.rename({"# ucsc": "Chromosome", "length": "End"}, axis = 1, inplace = True)
chromsizes["Start"] = 0
chromsizes = pr.PyRanges(chromsizes[["Chromosome", "Start", "End"]])

pr_annotation = pd.read_table(
        os.path.join(out_dir, "qc", "tss.bed")
    ).rename(
        {"Name": "Gene", "# Chromosome": "Chromosome"}, axis = 1)
pr_annotation["Transcription_Start_Site"] = pr_annotation["Start"]
genome_annotation = pr.PyRanges(pr_annotation)

You may want to use pandas while generating genome annotation file and save it in the /outs folder.

JinKyu-Cheong commented 4 weeks ago

after generating genome_

sorry i am facing the same problem. i just wondering how to fix it?

I solved it by following the pycistopic tutorial where chromsizes file is generated:
chromsizes = pd.read_table(os.path.join(out_dir, "qc", "hg38.chrom_sizes_and_alias.tsv"))
chromsizes
chromsizes.rename({"# ucsc": "Chromosome", "length": "End"}, axis = 1, inplace = True)
chromsizes["Start"] = 0
chromsizes = pr.PyRanges(chromsizes[["Chromosome", "Start", "End"]])

pr_annotation = pd.read_table(
        os.path.join(out_dir, "qc", "tss.bed")
    ).rename(
        {"Name": "Gene", "# Chromosome": "Chromosome"}, axis = 1)
pr_annotation["Transcription_Start_Site"] = pr_annotation["Start"]
genome_annotation = pr.PyRanges(pr_annotation)
You may want to use pandas while generating genome annotation file and save it in the /outs folder.

After generating chromosize and annotation.tsv manually, how did you modify the config and snakemake file? if i rerun the snakemake, these files are removed and the pipeline tries to download these files again, which leads to an error. how can you skip this download step?

yojetsharma commented 4 weeks ago

I didn't have to modify those line of codes (in the gene_search_space.py file) but you can try this suggestion in https://github.com/aertslab/scenicplus/issues/357#issuecomment-2064501829 and see if it helps.
They might be getting removed likely because of the format of either the chromsizes or genome_annotation file?

PhrenoVermouth commented 3 weeks ago

after generating genome_
sorry i am facing the same problem. i just wondering how to fix it?

I solved it by following the pycistopic tutorial where chromsizes file is generated:
chromsizes = pd.read_table(os.path.join(out_dir, "qc", "hg38.chrom_sizes_and_alias.tsv"))
chromsizes
chromsizes.rename({"# ucsc": "Chromosome", "length": "End"}, axis = 1, inplace = True)
chromsizes["Start"] = 0
chromsizes = pr.PyRanges(chromsizes[["Chromosome", "Start", "End"]])

pr_annotation = pd.read_table(
        os.path.join(out_dir, "qc", "tss.bed")
    ).rename(
        {"Name": "Gene", "# Chromosome": "Chromosome"}, axis = 1)
pr_annotation["Transcription_Start_Site"] = pr_annotation["Start"]
genome_annotation = pr.PyRanges(pr_annotation)
You may want to use pandas while generating genome annotation file and save it in the /outs folder.
After generating chromosize and annotation.tsv manually, how did you modify the config and snakemake file? if i rerun the snakemake, these files are removed and the pipeline tries to download these files again, which leads to an error. how can you skip this download step?

Hey @JinKyu-Cheong @yojetsharma Since tss.bed doesn't provide any genebody length information, I don't think it is a proper solution. FYI I attached my anno file.

genome_annotation.txt

aertslab / scenicplus

Chromsize file issue, snakemake doesn't proceed #482