aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
177 stars 28 forks source link

Mouse GRCm38 genome is more helpful #389

Closed lifei176 closed 4 months ago

lifei176 commented 4 months ago

In the "download_genome_annotations" step, "GRCm39" is the target genome for downstream analyses when tha species is mouse. However, compared with "GRCm39", "GRCm38" (mm10) is much more widely used. Can we optimize the code so that "GRCm38" will be the target genome? This will help the community a lot, especially for those exploring mouse data. Thank you in advance.

Raghav1881 commented 4 months ago

Go to the config.yaml and change the biomart_host to "http://nov2020.archive.ensembl.org"

SeppeDeWinter commented 4 months ago

@lifei176

Indeed, the answer of @Raghav1881 is correct.

Thank you :)

Best,

S

yahbish commented 1 month ago

Quick question on this - when I replaced the biomart url with the archive for mm10 (nov2020), the download_genome_annotations job retrieved GRCm38.p6 as expected, but the NCBI assembly information gathered after appears to be associated with mm39 - will this lead to issues?

####### Example output 2024-08-01 11:59:07,085 Download gene annotation INFO Using genome: GRCm38.p6

2024-08-01 11:59:07,099 Download gene annotation INFO Found corresponding genome Id 52 on NCBI

2024-08-01 11:59:07,616 Download gene annotation INFO Found corresponding assembly Id 7358741 on NCBI

2024-08-01 11:59:08,133 Download gene annotation INFO Downloading assembly information from: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_assembly_report.txt

yahbish commented 1 month ago

^ A quick selection of a few loci from the genome_annotations.tsv match the mm10 annotation, so I feel it may be fine, but I just want to make sure nothing downstream is complicated by the mismatch