ffdev-info / pronom-release-tools

Tools for working with PRONOM releases
https://ffdev-info.github.io/pronom-release-tools/
GNU General Public License v3.0
1 stars 0 forks source link

Download is failing because of an incorrect assumption re: container signature #10

Open ross-spencer opened 4 months ago

ross-spencer commented 4 months ago

The container signature file cannot be downloaded from PRONOM as the heuristic is wrong (could be a UTC issue as we're trying to download a file from May 1 and not April 30 -- but also the current container sig is dated later anyway.

What we find: https://cdn.nationalarchives.gov.uk/documents/container-signature-20240501.xml What we need: https://cdn.nationalarchives.gov.uk/documents/container-signature-20240430.xml

2024-05-09 07:28:39 INFO :: pronom_xml_export.py:67:download_and_save_puid() :: https://www.nationalarchives.gov.uk/PRONOM/x-fmt/455.xml
Traceback (most recent call last):
  File "/root/git/ffdev/release/venv/lib/python3.10/site-packages/pronom_summary/pronom_summary.py", line 39, in summarize_container_xml
    tree = etree.parse(pronom_container_xml)
  File "/usr/lib/python3.10/xml/etree/ElementTree.py", line 1222, in parse
    tree.parse(source, parser)
  File "/usr/lib/python3.10/xml/etree/ElementTree.py", line 580, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 46, column 43

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/git/ffdev/release/venv/bin/pronom-tools", line 8, in <module>
    sys.exit(main())
  File "/root/git/ffdev/release/venv/lib/python3.10/site-packages/pronom_tools/pronom_tools.py", line 523, in main
    asyncio.run(pronom_tools())
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/root/git/ffdev/release/venv/lib/python3.10/site-packages/pronom_tools/pronom_tools.py", line 498, in pronom_tools
    res = await get_summary(args.clean_and_summary)
  File "/root/git/ffdev/release/venv/lib/python3.10/site-packages/pronom_tools/pronom_tools.py", line 351, in get_summary
    res = await parse_pronom("pronom-export", sig_file)
  File "/root/git/ffdev/release/venv/lib/python3.10/site-packages/pronom_summary/pronom_summary.py", line 149, in parse_pronom
    container_summary = summarize_container_xml(container_signature)
  File "/root/git/ffdev/release/venv/lib/python3.10/site-packages/pronom_summary/pronom_summary.py", line 41, in summarize_container_xml
    raise PRONOMException(f"cannot parse xml: {pronom_container_xml}") from err
pronom_summary.pronom_summary.PRONOMException: cannot parse xml: container-signature-20240501.xml
(venv) root@localhost:~/git/ffdev/release# curl -I https://cdn.nationalarchives.gov.uk/documents/container-signature-20240430.xml

Unfortunately this is the only thing preventing automatic update yesterday :(

NB. as we fix this, move the container sig download to the front off the queue before downloading anything else as we don't need to download all of PRONOM just to fail.

ross-spencer commented 4 months ago

Tyler's screenshot shows the same issue:

image

I guess this doesn't impact DROID. Will need to find another workaround.