AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
192 stars 25 forks source link

Bad metadata URLs in gtt-check-or-setup-GTDB-files #81

Closed jmtsuji closed 8 months ago

jmtsuji commented 8 months ago

Hello @AstrobioMike , I found what seems to be an easy-to-solve issue with GToTree regarding the URLs for GTDB metadata. See below:

System environment

Mac OS, GToTree version 1.8.2

Problem description

When gtt-check-or-setup-GTDB-files is run in a fresh install of GToTree, the following HTTP 404 error occurs:

$ gtt-check-or-setup-GTDB-files

  Downloading and parsing archaeal and bacterial metadata tables from
  GTDB (only needs to be done once)...

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/gtotree_1.8.2/bin/gtt-check-or-setup-GTDB-files.backup", line 161, in <module>
    main()
  File "/opt/homebrew/Caskroom/miniforge/base/envs/gtotree_1.8.2/bin/gtt-check-or-setup-GTDB-files.backup", line 31, in main
    check_and_or_get_gtdb_files(os.environ["GTDB_dir"])
  File "/opt/homebrew/Caskroom/miniforge/base/envs/gtotree_1.8.2/bin/gtt-check-or-setup-GTDB-files.backup", line 157, in check_and_or_get_gtdb_files
    gen_gtdb_tab(GTDB_dir)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/gtotree_1.8.2/bin/gtt-check-or-setup-GTDB-files.backup", line 92, in gen_gtdb_tab
    arc_tar_gz = urllib.request.urlopen("https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tar.gz")
  File "/opt/homebrew/Caskroom/miniforge/base/envs/gtotree_1.8.2/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/gtotree_1.8.2/lib/python3.9/urllib/request.py", line 523, in open
    response = meth(req, response)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/gtotree_1.8.2/lib/python3.9/urllib/request.py", line 632, in http_response
    response = self.parent.error(
  File "/opt/homebrew/Caskroom/miniforge/base/envs/gtotree_1.8.2/lib/python3.9/urllib/request.py", line 561, in error
    return self._call_chain(*args)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/gtotree_1.8.2/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/gtotree_1.8.2/lib/python3.9/urllib/request.py", line 641, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Proposed solution

The URLs in the gen_gtdb_tab function of gtt-check-or-setup-GTDB-files no longer seem to match the URLs to the metadata files in the latest release of the GTDB (r214).

Current URLs used in GToTree

https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tar.gz
https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tar.gz

Actual URLs in GTDB release 214

https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tsv.gz
https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tsv.gz

Changing the tar to tsv and then re-running gtt-check-or-setup-GTDB-files worked on my end.

Revised code:

    # getting archaea
    arc_tsv_gz = urllib.request.urlopen("https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tsv.gz")
    arc_tab = pd.read_csv(arc_tsv_gz, sep="\t", compression="gzip", on_bad_lines = 'skip', header=0, low_memory=False)
    arc_tab.rename(columns={arc_tab.columns[0]:"accession"}, inplace=True)
    arc_tab.dropna(inplace=True, how="all")

    # getting bacteria
    bac_tsv_gz = urllib.request.urlopen("https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tsv.gz")
    bac_tab = pd.read_csv(bac_tsv_gz, sep="\t", compression="gzip", on_bad_lines = 'skip', header=0, low_memory=False)
    bac_tab.rename(columns={bac_tab.columns[0]:"accession"}, inplace=True)
    bac_tab.dropna(inplace=True, how="all")

gtt-test.sh finishes without errors after downloading the GTDB metadata via the revised URLs above.

Final comments

Thanks for all your work on GToTree! It's an extremely helpful package!

AstrobioMike commented 8 months ago

Beautiful! Thanks so much for notifying me about the change and fix! Things are updated as of v1.8.3 :)

jmtsuji commented 8 months ago

Excellent, thanks!