Closed lifei176 closed 6 months ago
Go to the config.yaml and change the biomart_host to "http://nov2020.archive.ensembl.org"
@lifei176
Indeed, the answer of @Raghav1881 is correct.
Thank you :)
Best,
S
Quick question on this - when I replaced the biomart url with the archive for mm10 (nov2020), the download_genome_annotations job retrieved GRCm38.p6 as expected, but the NCBI assembly information gathered after appears to be associated with mm39 - will this lead to issues?
####### Example output 2024-08-01 11:59:07,085 Download gene annotation INFO Using genome: GRCm38.p6
2024-08-01 11:59:07,099 Download gene annotation INFO Found corresponding genome Id 52 on NCBI
2024-08-01 11:59:07,616 Download gene annotation INFO Found corresponding assembly Id 7358741 on NCBI
2024-08-01 11:59:08,133 Download gene annotation INFO Downloading assembly information from: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_assembly_report.txt
^ A quick selection of a few loci from the genome_annotations.tsv match the mm10 annotation, so I feel it may be fine, but I just want to make sure nothing downstream is complicated by the mismatch
Same issue here, it is not about the biomart version. I'm not sure whether it might bring any uncertainties. Any comments appreciated.
p.s. I've tried my best to let it download from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6/GCF_000001635.26_GRCm38.p6_assembly_report.txt, but I can't figure out where it generates genome_annotation.tsv and chromsizes.tsv. Reading the source code is really headache...
^ A quick selection of a few loci from the genome_annotations.tsv match the mm10 annotation, so I feel it may be fine, but I just want to make sure nothing downstream is complicated by the mismatch
Sorry to say dude @yahbish , I also checked this out, it influenced a lot. First, to make it clear, the genome_annotation.tsv is mm39 version; Second, for a gene like, the 38 and 39 versions could be very different. A million-level bp mismatch could turn the analysis into meaningless.
Hi @PhrenoVermouth and @yahbish
Yes, indeed it's important that you use the correct genome annotation file! However, in your case I'm quite sure it downloaded the correct genome annotation file, the chromsizes file might be wrong though (this I would download manually from: https://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/
Also now we have a new command in pycisTopic to download this data:
see
pycistopic tss get_tss --help
Best,
Seppe
Thank you for your response! I managed to find a workaround for [issue #401] (https://github.com/aertslab/scenicplus/issues/401). Although the log output still shows mm39 under INFO, the underlying content appears to be corrected.
In reviewing the source code, I noticed an odd behavior: it initially pulls the correct information from the mm38 Ensembl archive (manually specified), but then, when it detects the default version in NCBI as mm39, the Ensembl results aren’t output in the final data. Given my limited coding experience, I wasn't able to investigate further, but I hope this insight helps.
In the "download_genome_annotations" step, "GRCm39" is the target genome for downstream analyses when tha species is mouse. However, compared with "GRCm39", "GRCm38" (mm10) is much more widely used. Can we optimize the code so that "GRCm38" will be the target genome? This will help the community a lot, especially for those exploring mouse data. Thank you in advance.