Open franciskim-yonsei opened 1 month ago
Hi @franciskim-yonsei
Thank you for your detailed descriptions of your concerns, this is very helpful. Indeed, modifying the biomart host is the correct way to get older assemblies. This could be better documented.
I was not aware of the url formatting issue!
The second problem you faced is very valid and this should not happen.
I have added this issue to my todo list!
Best,
Seppe
Can we also skip this error by specifying in advance the genome_annotation
and chromsizes
file paths in the output_data section of the snakemake config.yaml
file (thereby avoiding running the download_genome_annotations
rule of the Snakefile
)?
Hi @davidhbrann
Yes, that's also a possibility!
All the best,
Seppe
Sometimes one needs to analyze data that has been aligned against an old genome assembly. As far as I can see, there is no explicit direction in the documentation as to how the workflow should be modified in such occasions. I suspect that the correct way to go is to modify the parameter
biomart_host
. Please let me know if this isn't the standard solution. Anyway, two problems arise when this route is taken.Problem 1.
ConnectionError
due to peculiarities ofpybiomart
To reproduce Let us say that the assembly of interest is GRCm38.p6 for Mus musculus, so suppose that one edits the relevant line 57 of
config.yaml
as follows.Then at the
download_genome_annotations
stage of the workflow, one gets the followingError output:
Solution The error report suggests that URL processing within
pybiomart
must have gone astray. In fact, to avoid this one need only remove the 'https://' part inconfig.yaml
.This is a simple solution, but nowhere in the documentation pages for
scenicplus
,pybiomart
or Ensembl website could I find any indication that URL formatting can cause problems. I think it would benefit many users if the authors could provide more detailed instructions in the tutorial.Problem 2. Fetching the wrong assembly report
To reproduce Once the above solution is applied, the
download_genome_annotations
step seems to proceed normally, but not without some concerningLogs:
Expected behavior I think that if the user has specified to use the GRCm38.p6 assembly, then the assembly report that should be downloaded is the GRCm38.p6 assembly report, not the GRCm39 assembly report. Please let me know if this doesn't make a crucial difference for the subsequent analyses. But I have checked the resulting
chromsizes.tsv
file and verified that the chromosome size information actually coincides with that of GRCm39, not of GRCm38.p6. So I am concerned that at the current state the workflow might make inferences based on inaccurate information.Solution I have managed to force the workflow to fetch the desired URL by replacing lines 84-139 of
data_wrangling/gene_search_space.py
with the following code:I don't know this hack will work for users planning to use other genome assemblies as well. I would be grateful if the authors looked into this matter and provided optimal solutions.
Version (please complete the following information):