Error in download_genome_annotations: "Could not find IdList" / "Could not find DocSum"

kennethho04 commented 1 month ago

Describe the bug I am trying to run download_genome_annotations in the snakemake pipeline but unable to produce genome annotation and chromsizes tsv files.

To Reproduce scenicplus prepare_data download_genome_annotations --species "hsapiens" --genome_annotation_out_fname genome_annotation.tsv --chromsizes_out_fname chromsizes.tsv

Error output 2024-10-05 14:19:06,167 Download gene annotation INFO Using genome: GRCh38.p14 Could not find IdList on https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=genome&term=GRCh38.p14 Returning gene annotation without subestting for assembled chromosomesand converting to UCSC style. Please make sure that the chromosome namesin the returned object match with the chromosome names in the scplus_obj.Chromosome sizes will not be returned 2024-10-05 14:19:06,170 SCENIC+ INFO Chrosomome sizes was not found, please provide this information manually. 2024-10-05 14:19:06,170 SCENIC+ INFO Saving genome annotation to: genome_annotation.tsv

Expected behavior (from tutorial) 2024-03-11 15:20:03,500 Download gene annotation INFO Using genome: GRCh38.p14 2024-03-11 15:20:04,112 Download gene annotation INFO Found corresponding genome Id 51 on NCBI 2024-03-11 15:20:05,268 Download gene annotation INFO Found corresponding assembly Id 11968211 on NCBI 2024-03-11 15:20:06,251 Download gene annotation INFO Downloading assembly information from: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_assembly_report.txt 2024-03-11 15:20:37,276 Download gene annotation INFO Found following assembled molecules (chromosomes): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT 2024-03-11 15:20:37,293 Download gene annotation INFO Converting chromosomes names to UCSC style as follows: Original UCSC 1 chr1 2 chr2 3 chr3 4 chr4 5 chr5 6 chr6 7 chr7 8 chr8 9 chr9 10 chr10 11 chr11 12 chr12 13 chr13 14 chr14 15 chr15 16 chr16 17 chr17 18 chr18 19 chr19 20 chr20 21 chr21 22 chr22 X chrX Y chrY MT chrM 2024-03-11 15:20:37,311 SCENIC+ INFO Saving chromosome sizes to: /staging/leuven/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus_development_tutorial/outs/chromsizes.tsv 2024-03-11 15:20:37,326 SCENIC+ INFO Saving genome annotation to: /staging/leuven/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus_development_tutorial/outs/genome_annotation.tsv

Version:

Python: 3.11.9
SCENIC+: 1.0a1

Additional context I have also ran snakemake --cores 10 on a cluster and got a very similar error output except it's "Could not find DocSum on...":

rohitarorayyc commented 1 month ago

Running into the same issue!

SeppeDeWinter commented 1 month ago

Hi @kennethho04 and @rohitarorayyc

Looks like the chromsizes file could not be downloaded automatically, you can download it from https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes (in case you are using human data and hg38 assembly).

All the best,

Seppe

kennethho04 commented 1 month ago

Hi @SeppeDeWinter

Thanks for the help! I was able to resolve error and move on.

Here is a more detailed steps of what I did so others can reference (or correct me) if needed: I referenced steps described in the Getting pseudobulk profiles from cell annotations and Gene activity sections of the pycisTopic tutorial to get the chromsizes.tsv and genome_annotation.tsv files.

For chromsizes.tsv:

import pandas as pd
chromsizes = pd.read_table(
    "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
    header = None,
    names = ["Chromosome", "End"]
)
chromsizes.insert(1, "Start", 0)
chromsizes.head()
chromsizes.to_csv('chromsizes.tsv', sep='\t', index=False)

For genome_annotation.tsv:

import pyranges as pr

pr_annotation = pd.read_table(
        os.path.join("/path/to/pycisTopic/outs", "qc", "tss.bed") ##from pycisTopic workflow
    ).rename(
        {"Name": "Gene", "# Chromosome": "Chromosome"}, axis = 1)
pr_annotation["Transcription_Start_Site"] = pr_annotation["Start"]
pr_annotation = pr.PyRanges(pr_annotation)
pr_annotation
pr_annotation.to_csv('genome_annotation.tsv', sep='\t')

PhrenoVermouth commented 3 weeks ago

Hi @SeppeDeWinter

Thanks for the help! I was able to resolve error and move on.

Here is a more detailed steps of what I did so others can reference (or correct me) if needed: I referenced steps described in the Getting pseudobulk profiles from cell annotations and Gene activity sections of the pycisTopic tutorial to get the chromsizes.tsv and genome_annotation.tsv files.

For chromsizes.tsv:
import pandas as pd
chromsizes = pd.read_table(
    "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
    header = None,
    names = ["Chromosome", "End"]
)
chromsizes.insert(1, "Start", 0)
chromsizes.head()
chromsizes.to_csv('chromsizes.tsv', sep='\t', index=False)
For genome_annotation.tsv:
import pyranges as pr

pr_annotation = pd.read_table(
        os.path.join("/path/to/pycisTopic/outs", "qc", "tss.bed") ##from pycisTopic workflow
    ).rename(
        {"Name": "Gene", "# Chromosome": "Chromosome"}, axis = 1)
pr_annotation["Transcription_Start_Site"] = pr_annotation["Start"]
pr_annotation = pr.PyRanges(pr_annotation)
pr_annotation
pr_annotation.to_csv('genome_annotation.tsv', sep='\t')

Oh dear, I'm afraid there's no genebody length info in this solution, all length=1.

+--------------------------+-----------+-----------+------------+------------+--------------+-------------------+----------------------------+
| Chromosome               | Start     | End       | Gene       | Score      | Strand       | Transcript_type   | Transcription_Start_Site   |
| (category)               | (int32)   | (int32)   | (object)   | (object)   | (category)   | (object)          | (int64)                    |
|--------------------------+-----------+-----------+------------+------------+--------------+-------------------+----------------------------|
| CHR_CAST_EI_MMCHR11_CTG4 | 71388827  | 71388828  | LT629147.2 | .          | -            | protein_coding    | 71388827                   |
| CHR_CAST_EI_MMCHR11_CTG4 | 71126031  | 71126032  | LT629147.3 | .          | -            | protein_coding    | 71126031                   |
| CHR_CAST_EI_MMCHR11_CTG4 | 71192095  | 71192096  | LT629147.4 | .          | -            | protein_coding    | 71192095                   |
| CHR_CAST_EI_MMCHR11_CTG4 | 71242912  | 71242913  | Nlrp1b     | .          | -            | protein_coding    | 71242912                   |
| ...                      | ...       | ...       | ...        | ...        | ...          | ...               | ...                        |
| chrY                     | 2170408   | 2170409   | Zfy2       | .          | -            | protein_coding    | 2170408                    |
| chrY                     | 2663657   | 2663658   | Sry        | .          | -            | protein_coding    | 2663657                    |
| chrY                     | 2720673   | 2720674   | H2al2b     | .          | -            | protein_coding    | 2720673                    |
| chrY                     | 2796204   | 2796205   | Gm4064     | .          | -            | protein_coding    | 2796204                    |
+--------------------------+-----------+-----------+------------+------------+--------------+-------------------+----------------------------+

aertslab / scenicplus

Error in download_genome_annotations: "Could not find IdList" / "Could not find DocSum" #476