Mouse GRCm38 genome is more helpful

aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.

Other

186 stars 29 forks source link

Mouse GRCm38 genome is more helpful #389

Closed lifei176 closed 6 months ago

lifei176 commented 6 months ago

In the "download_genome_annotations" step, "GRCm39" is the target genome for downstream analyses when tha species is mouse. However, compared with "GRCm39", "GRCm38" (mm10) is much more widely used. Can we optimize the code so that "GRCm38" will be the target genome? This will help the community a lot, especially for those exploring mouse data. Thank you in advance.

Raghav1881 commented 6 months ago

Go to the config.yaml and change the biomart_host to "http://nov2020.archive.ensembl.org"

SeppeDeWinter commented 6 months ago

@lifei176

Indeed, the answer of @Raghav1881 is correct.

Thank you :)

Best,

yahbish commented 3 months ago

Quick question on this - when I replaced the biomart url with the archive for mm10 (nov2020), the download_genome_annotations job retrieved GRCm38.p6 as expected, but the NCBI assembly information gathered after appears to be associated with mm39 - will this lead to issues?

####### Example output 2024-08-01 11:59:07,085 Download gene annotation INFO Using genome: GRCm38.p6

2024-08-01 11:59:07,099 Download gene annotation INFO Found corresponding genome Id 52 on NCBI

2024-08-01 11:59:07,616 Download gene annotation INFO Found corresponding assembly Id 7358741 on NCBI

2024-08-01 11:59:08,133 Download gene annotation INFO Downloading assembly information from: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_assembly_report.txt

yahbish commented 3 months ago

^ A quick selection of a few loci from the genome_annotations.tsv match the mm10 annotation, so I feel it may be fine, but I just want to make sure nothing downstream is complicated by the mismatch

PhrenoVermouth commented 3 weeks ago

Same issue here, it is not about the biomart version. I'm not sure whether it might bring any uncertainties. Any comments appreciated.

PhrenoVermouth commented 3 weeks ago

p.s. I've tried my best to let it download from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6/GCF_000001635.26_GRCm38.p6_assembly_report.txt, but I can't figure out where it generates genome_annotation.tsv and chromsizes.tsv. Reading the source code is really headache...

PhrenoVermouth commented 3 weeks ago

^ A quick selection of a few loci from the genome_annotations.tsv match the mm10 annotation, so I feel it may be fine, but I just want to make sure nothing downstream is complicated by the mismatch

Sorry to say dude @yahbish , I also checked this out, it influenced a lot. First, to make it clear, the genome_annotation.tsv is mm39 version; Second, for a gene like, the 38 and 39 versions could be very different. A million-level bp mismatch could turn the analysis into meaningless.

SeppeDeWinter commented 1 week ago

Hi @PhrenoVermouth and @yahbish

Yes, indeed it's important that you use the correct genome annotation file! However, in your case I'm quite sure it downloaded the correct genome annotation file, the chromsizes file might be wrong though (this I would download manually from: https://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/

Also now we have a new command in pycisTopic to download this data:

see


pycistopic tss get_tss --help

Best,

Seppe

PhrenoVermouth commented 1 week ago

Thank you for your response! I managed to find a workaround for [issue #401] (https://github.com/aertslab/scenicplus/issues/401). Although the log output still shows mm39 under INFO, the underlying content appears to be corrected.

In reviewing the source code, I noticed an odd behavior: it initially pulls the correct information from the mm38 Ensembl archive (manually specified), but then, when it detects the default version in NCBI as mm39, the Ensembl results aren’t output in the final data. Given my limited coding experience, I wasn't able to investigate further, but I hope this insight helps.