Two problems in download_genome_annotations when using old genome assemblies (with makeshift solutions)

franciskim-yonsei commented 1 month ago

Sometimes one needs to analyze data that has been aligned against an old genome assembly. As far as I can see, there is no explicit direction in the documentation as to how the workflow should be modified in such occasions. I suspect that the correct way to go is to modify the parameter biomart_host. Please let me know if this isn't the standard solution. Anyway, two problems arise when this route is taken.

Problem 1. `ConnectionError` due to peculiarities of `pybiomart`

To reproduce Let us say that the assembly of interest is GRCm38.p6 for Mus musculus, so suppose that one edits the relevant line 57 of config.yaml as follows.

biomart_host: "https://nov2020.archive.ensembl.org"

Then at the download_genome_annotations stage of the workflow, one gets the following

Error output:

localrule download_genome_annotations:
    output: genome_annotation.tsv, chromsizes.tsv
    jobid: 8
    reason: Params have changed since last execution
    resources: tmpdir=/tmp

Traceback (most recent call last):
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/socket.py", line 962, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/urllib3/connectionpool.py", line 496, in _make_request
    conn.request(
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/urllib3/connection.py", line 400, in request
    self.endheaders()
  File "/usr/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.11/http/client.py", line 1058, in _send_output
    self.send(msg)
  File "/usr/lib/python3.11/http/client.py", line 996, in send
    self.connect()
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/urllib3/connection.py", line 238, in connect
    self.sock = self._new_conn()
                ^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/urllib3/connection.py", line 205, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPConnection object at 0x72757301e210>: Failed to resolve 'https' ([Errno -3] Temporary failure in name resolution)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='https', port=80): Max retries exceeded with url: //nov2020.archive.ensembl.org:80/biomart/martservice?type=registry (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x72757301e210>: Failed to resolve 'https' ([Errno -3] Temporary failure in name resolution)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/envs/scenicplus3.11/bin/scenicplus", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 1137, in main
    args.func(args)
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 169, in download_command
    download_gene_annotation_chromsizes(
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/scenicplus/cli/commands.py", line 600, in download_gene_annotation_chromsizes
    result = download_gene_annotation_and_chromsizes(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/scenicplus/data_wrangling/gene_search_space.py", line 59, in download_gene_annotation_and_chromsizes
    mart = server["ENSEMBL_MART_ENSEMBL"]
           ~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/pybiomart/server.py", line 55, in __getitem__
    return self.marts[name]
           ^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/pybiomart/server.py", line 61, in marts
    self._marts = self._fetch_marts()
                  ^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/pybiomart/server.py", line 79, in _fetch_marts
    response = self.get(type='registry')
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/pybiomart/base.py", line 110, in get
    r = requests.get(self.url, params=params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/envs/scenicplus3.11/lib/python3.11/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='https', port=80): Max retries exceeded with url: //nov2020.archive.ensembl.org:80/biomart/martservice?type=registry (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x72757301e210>: Failed to resolve 'https' ([Errno -3] Temporary failure in name resolution)"))

Solution The error report suggests that URL processing within pybiomart must have gone astray. In fact, to avoid this one need only remove the 'https://' part in config.yaml.

biomart_host: "nov2020.archive.ensembl.org"

This is a simple solution, but nowhere in the documentation pages for scenicplus, pybiomart or Ensembl website could I find any indication that URL formatting can cause problems. I think it would benefit many users if the authors could provide more detailed instructions in the tutorial.

Problem 2. Fetching the wrong assembly report

To reproduce Once the above solution is applied, the download_genome_annotations step seems to proceed normally, but not without some concerning

Logs:

2024-05-22 14:26:19,434 Download gene annotation INFO     Using genome: GRCm38.p6
2024-05-22 14:26:21,223 Download gene annotation INFO     Found corresponding genome Id 52 on NCBI
2024-05-22 14:26:23,312 Download gene annotation INFO     Found corresponding assembly Id 7358741 on NCBI
2024-05-22 14:26:25,768 Download gene annotation INFO     Downloading assembly information from: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_assembly_report.txt

Expected behavior I think that if the user has specified to use the GRCm38.p6 assembly, then the assembly report that should be downloaded is the GRCm38.p6 assembly report, not the GRCm39 assembly report. Please let me know if this doesn't make a crucial difference for the subsequent analyses. But I have checked the resulting chromsizes.tsv file and verified that the chromosome size information actually coincides with that of GRCm39, not of GRCm38.p6. So I am concerned that at the current state the workflow might make inferences based on inaccurate information.

Solution I have managed to force the workflow to fetch the desired URL by replacing lines 84-139 of data_wrangling/gene_search_space.py with the following code:

ncbi_search_genome_id_response = requests.get(
    f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=assembly&term={ncbi_search_term}")
while not ncbi_search_genome_id_response.ok and ncbi_tries < _NCBI_MAX_RETRIES:
    time.sleep(0.5) #sleep to not get blocked by NCBI (max 3 requests per second)
    ncbi_search_genome_id_response = requests.get(
        f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=assembly&term={ncbi_search_term}")
    ncbi_tries = ncbi_tries + 1
if (ncbi_tries == _NCBI_MAX_RETRIES) and not ncbi_search_genome_id_response.ok:
    raise MaxNCBIRetriesReached
_IdList_element = xml_tree.fromstring(ncbi_search_genome_id_response.content) \
    .find('IdList')
if _IdList_element is None:
    raise NCBISearchNotFound(
        search_term = "IdList",
        url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=assembly&term={ncbi_search_term}")
_Id_element = _IdList_element.find('Id')
if _Id_element is None:
    raise NCBISearchNotFound(
        search_term = "Id",
        url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=assembly&term={ncbi_search_term}")
ncbi_assembly_id = _Id_element.text

I don't know this hack will work for users planning to use other genome assemblies as well. I would be grateful if the authors looked into this matter and provided optimal solutions.

Version (please complete the following information):

Python: Python 3.11.9
SCENIC+: 1.0a1
pybiomart: 0.2.0

SeppeDeWinter commented 2 weeks ago

Hi @franciskim-yonsei

Thank you for your detailed descriptions of your concerns, this is very helpful. Indeed, modifying the biomart host is the correct way to get older assemblies. This could be better documented.

I was not aware of the url formatting issue!

The second problem you faced is very valid and this should not happen.

I have added this issue to my todo list!

Best,

Seppe

davidhbrann commented 1 week ago

Can we also skip this error by specifying in advance the genome_annotation and chromsizes file paths in the output_data section of the snakemake config.yaml file (thereby avoiding running the download_genome_annotations rule of the Snakefile)?

SeppeDeWinter commented 1 week ago

Hi @davidhbrann

Yes, that's also a possibility!

All the best,

Seppe

aertslab / scenicplus