Closed kennethho04 closed 1 month ago
Running into the same issue!
Hi @kennethho04 and @rohitarorayyc
Looks like the chromsizes file could not be downloaded automatically, you can download it from https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes (in case you are using human data and hg38 assembly).
All the best,
Seppe
Hi @SeppeDeWinter
Thanks for the help! I was able to resolve error and move on.
Here is a more detailed steps of what I did so others can reference (or correct me) if needed: I referenced steps described in the Getting pseudobulk profiles from cell annotations and Gene activity sections of the pycisTopic tutorial to get the chromsizes.tsv and genome_annotation.tsv files.
For chromsizes.tsv:
import pandas as pd
chromsizes = pd.read_table(
"http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
header = None,
names = ["Chromosome", "End"]
)
chromsizes.insert(1, "Start", 0)
chromsizes.head()
chromsizes.to_csv('chromsizes.tsv', sep='\t', index=False)
For genome_annotation.tsv:
import pyranges as pr
pr_annotation = pd.read_table(
os.path.join("/path/to/pycisTopic/outs", "qc", "tss.bed") ##from pycisTopic workflow
).rename(
{"Name": "Gene", "# Chromosome": "Chromosome"}, axis = 1)
pr_annotation["Transcription_Start_Site"] = pr_annotation["Start"]
pr_annotation = pr.PyRanges(pr_annotation)
pr_annotation
pr_annotation.to_csv('genome_annotation.tsv', sep='\t')
Hi @SeppeDeWinter
Thanks for the help! I was able to resolve error and move on.
Here is a more detailed steps of what I did so others can reference (or correct me) if needed: I referenced steps described in the Getting pseudobulk profiles from cell annotations and Gene activity sections of the pycisTopic tutorial to get the chromsizes.tsv and genome_annotation.tsv files.
For chromsizes.tsv:
import pandas as pd chromsizes = pd.read_table( "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes", header = None, names = ["Chromosome", "End"] ) chromsizes.insert(1, "Start", 0) chromsizes.head() chromsizes.to_csv('chromsizes.tsv', sep='\t', index=False)
For genome_annotation.tsv:
import pyranges as pr pr_annotation = pd.read_table( os.path.join("/path/to/pycisTopic/outs", "qc", "tss.bed") ##from pycisTopic workflow ).rename( {"Name": "Gene", "# Chromosome": "Chromosome"}, axis = 1) pr_annotation["Transcription_Start_Site"] = pr_annotation["Start"] pr_annotation = pr.PyRanges(pr_annotation) pr_annotation pr_annotation.to_csv('genome_annotation.tsv', sep='\t')
Oh dear, I'm afraid there's no genebody length info in this solution, all length=1.
+--------------------------+-----------+-----------+------------+------------+--------------+-------------------+----------------------------+
| Chromosome | Start | End | Gene | Score | Strand | Transcript_type | Transcription_Start_Site |
| (category) | (int32) | (int32) | (object) | (object) | (category) | (object) | (int64) |
|--------------------------+-----------+-----------+------------+------------+--------------+-------------------+----------------------------|
| CHR_CAST_EI_MMCHR11_CTG4 | 71388827 | 71388828 | LT629147.2 | . | - | protein_coding | 71388827 |
| CHR_CAST_EI_MMCHR11_CTG4 | 71126031 | 71126032 | LT629147.3 | . | - | protein_coding | 71126031 |
| CHR_CAST_EI_MMCHR11_CTG4 | 71192095 | 71192096 | LT629147.4 | . | - | protein_coding | 71192095 |
| CHR_CAST_EI_MMCHR11_CTG4 | 71242912 | 71242913 | Nlrp1b | . | - | protein_coding | 71242912 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| chrY | 2170408 | 2170409 | Zfy2 | . | - | protein_coding | 2170408 |
| chrY | 2663657 | 2663658 | Sry | . | - | protein_coding | 2663657 |
| chrY | 2720673 | 2720674 | H2al2b | . | - | protein_coding | 2720673 |
| chrY | 2796204 | 2796205 | Gm4064 | . | - | protein_coding | 2796204 |
+--------------------------+-----------+-----------+------------+------------+--------------+-------------------+----------------------------+
Describe the bug I am trying to run
download_genome_annotations
in the snakemake pipeline but unable to produce genome annotation and chromsizes tsv files.To Reproduce
scenicplus prepare_data download_genome_annotations --species "hsapiens" --genome_annotation_out_fname genome_annotation.tsv --chromsizes_out_fname chromsizes.tsv
Error output 2024-10-05 14:19:06,167 Download gene annotation INFO Using genome: GRCh38.p14 Could not find IdList on https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=genome&term=GRCh38.p14 Returning gene annotation without subestting for assembled chromosomesand converting to UCSC style. Please make sure that the chromosome namesin the returned object match with the chromosome names in the scplus_obj.Chromosome sizes will not be returned 2024-10-05 14:19:06,170 SCENIC+ INFO Chrosomome sizes was not found, please provide this information manually. 2024-10-05 14:19:06,170 SCENIC+ INFO Saving genome annotation to: genome_annotation.tsv
Expected behavior (from tutorial) 2024-03-11 15:20:03,500 Download gene annotation INFO Using genome: GRCh38.p14 2024-03-11 15:20:04,112 Download gene annotation INFO Found corresponding genome Id 51 on NCBI 2024-03-11 15:20:05,268 Download gene annotation INFO Found corresponding assembly Id 11968211 on NCBI 2024-03-11 15:20:06,251 Download gene annotation INFO Downloading assembly information from: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_assembly_report.txt 2024-03-11 15:20:37,276 Download gene annotation INFO Found following assembled molecules (chromosomes): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT 2024-03-11 15:20:37,293 Download gene annotation INFO Converting chromosomes names to UCSC style as follows: Original UCSC 1 chr1 2 chr2 3 chr3 4 chr4 5 chr5 6 chr6 7 chr7 8 chr8 9 chr9 10 chr10 11 chr11 12 chr12 13 chr13 14 chr14 15 chr15 16 chr16 17 chr17 18 chr18 19 chr19 20 chr20 21 chr21 22 chr22 X chrX Y chrY MT chrM 2024-03-11 15:20:37,311 SCENIC+ INFO Saving chromosome sizes to: /staging/leuven/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus_development_tutorial/outs/chromsizes.tsv 2024-03-11 15:20:37,326 SCENIC+ INFO Saving genome annotation to: /staging/leuven/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus_development_tutorial/outs/genome_annotation.tsv
Version:
Additional context I have also ran
snakemake --cores 10
on a cluster and got a very similar error output except it's "Could not find DocSum on...":