bio-guoda / preston

a biodiversity dataset tracker
MIT License
25 stars 1 forks source link

illegal URL when resolving content using preston cat --remote https://zenodo.org 'hash://md5/58fd5af87a78f16c995c987ea4ab390e' > document.pdf #279

Closed jhpoelen closed 4 months ago

jhpoelen commented 4 months ago
preston cat --remote https://zenodo.org 'hash://md5/58fd5af87a78f16c995c987ea4ab390e'\
 > document.pdf

produced an unexpected exception:

Caused by: java.net.URISyntaxException: Illegal character in path at index 54: https://zenodo.org/api/records/10778598/files/Driessen et al., 1991.pdf/content

expected that:

Caused by: java.net.URISyntaxException: Illegal character in path at index 54: https://zenodo.org/api/records/10778598/files/Driessen et al., 1991.pdf/content
    at java.net.URI$Parser.fail(URI.java:2847)
    at java.net.URI$Parser.checkChars(URI.java:3020)
    at java.net.URI$Parser.parseHierarchical(URI.java:3104)
    at java.net.URI$Parser.parse(URI.java:3052)
    at java.net.URI.<init>(URI.java:588)
    at java.net.URI.create(URI.java:850)
    ... 33 more
java.lang.RuntimeException: java.io.IOException: problem retrieving [hash://md5/58fd5af87a78f16c995c987ea4ab390e]
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:52)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:44)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at bio.guoda.preston.Preston.run(Preston.java:99)
    at bio.guoda.preston.Preston.main(Preston.java:90)
Caused by: java.io.IOException: problem retrieving [hash://md5/58fd5af87a78f16c995c987ea4ab390e]
    at bio.guoda.preston.cmd.ContentQueryUtil.getContent(ContentQueryUtil.java:58)
    at bio.guoda.preston.cmd.ContentQueryUtil.copyContent(ContentQueryUtil.java:32)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:69)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:49)
    ... 11 more
Caused by: bio.guoda.preston.store.DereferenceException: failed to dereference [hash://md5/58fd5af87a78f16c995c987ea4ab390e]
    at bio.guoda.preston.store.AliasDereferencer.dereferenceAliasedHash(AliasDereferencer.java:94)
    at bio.guoda.preston.store.AliasDereferencer.get(AliasDereferencer.java:46)
    at bio.guoda.preston.store.AliasDereferencer.get(AliasDereferencer.java:18)
    at bio.guoda.preston.cmd.ContentQueryUtil.getContent(ContentQueryUtil.java:50)
    ... 14 more
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 54: https://zenodo.org/api/records/10778598/files/Driessen et al., 1991.pdf/content
    at java.net.URI.create(URI.java:852)
    at bio.guoda.preston.store.KeyTo1LevelZenodoPath.findFirstHit(KeyTo1LevelZenodoPath.java:88)
    at bio.guoda.preston.store.KeyTo1LevelZenodoPath.toPath(KeyTo1LevelZenodoPath.java:49)
    at bio.guoda.preston.store.KeyTo1LevelZenodoBucket.toPath(KeyTo1LevelZenodoBucket.java:24)
    at bio.guoda.preston.store.KeyValueStoreWithDereferencing.get(KeyValueStoreWithDereferencing.java:26)
    at bio.guoda.preston.store.KeyValueStoreWithDereferencing.get(KeyValueStoreWithDereferencing.java:11)
    at bio.guoda.preston.store.KeyValueStoreWithValidation.get(KeyValueStoreWithValidation.java:59)
    at bio.guoda.preston.store.KeyValueStoreWithValidation.get(KeyValueStoreWithValidation.java:14)
    at bio.guoda.preston.store.KeyValueStoreStickyFailover.get(KeyValueStoreStickyFailover.java:42)
    at bio.guoda.preston.store.KeyValueStoreStickyFailover.get(KeyValueStoreStickyFailover.java:13)
    at bio.guoda.preston.store.KeyValueStoreCopying.get(KeyValueStoreCopying.java:31)
    at bio.guoda.preston.store.KeyValueStoreCopying.get(KeyValueStoreCopying.java:8)
    at bio.guoda.preston.store.BlobStoreAppendOnly.get(BlobStoreAppendOnly.java:44)
    at bio.guoda.preston.store.BlobStoreAppendOnly.get(BlobStoreAppendOnly.java:11)
    at bio.guoda.preston.store.ContentHashDereferencer.get(ContentHashDereferencer.java:22)
    ... 19 more
jhpoelen commented 4 months ago

as seen in demo with @slint and @myrmoteras et al. at the Arcadia Sprint #1 at CERN on 12 March 2024 https://github.com/plazi/arcadia-project

jhpoelen commented 4 months ago

associated zenodo query was:

https://zenodo.org/api/records?q=files.entries.checksum:%22md5:58fd5af87a78f16c995c987ea4ab390e%22&allversions=1

with result:

{
  "hits": {
    "hits": [
      {
        "created": "2024-03-04T17:00:52.979185+00:00",
        "modified": "2024-03-04T17:00:53.809349+00:00",
        "id": 10778598,
        "conceptrecid": "10778597",
        "doi": "10.5281/zenodo.10778598",
        "conceptdoi": "10.5281/zenodo.10778597",
        "doi_url": "https://doi.org/10.5281/zenodo.10778598",
        "metadata": {
          "title": "Host selection behaviour of the parasitoid Leptopilina clavipes, in relation to survival in hosts.",
          "doi": "10.5281/zenodo.10778598",
          "publication_date": "1991",
          "description": "Uploaded by Plazi for TaxoDros. We do not have abstracts.",
          "access_right": "open",
          "creators": [
            {
              "name": "Driessen, G.",
              "affiliation": null
            },
            {
              "name": "Hemerik, L.",
              "affiliation": null
            },
            {
              "name": "Boonstra, B.",
              "affiliation": null
            }
          ],
          "keywords": [
            "Biodiversity",
            "Taxonomy",
            "fruit flies",
            "flies",
            "Animalia",
            "Arthropoda",
            "Insecta",
            "Diptera"
          ],
          "related_identifiers": [
            {
              "identifier": "https://linker.bio/line:hash://md5/ff86b940567d278e50fa00672cf96629!/L45510-L45519",
              "relation": "isDerivedFrom",
              "scheme": "url"
            },
            {
              "identifier": "10.5281/zenodo.10723540",
              "relation": "isDerivedFrom",
              "scheme": "doi"
            },
            {
              "identifier": "https://www.taxodros.uzh.ch",
              "relation": "isPartOf",
              "scheme": "url"
            }
          ],
          "references": [
            "Bächli, G. (2024). TaxoDros - The Database on Taxonomy of Drosophilidae hash://md5/26a67012dde325cf2a3a058cc2f9c1b8 hash://sha256/ca86d74b318a334bddbc7c6a387a09530a083b8617718f5369ad548744c602d3 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10723540"
          ],
          "custom": {
            "dwc:class": [
              "Insecta"
            ],
            "dwc:kingdom": [
              "Animalia"
            ],
            "dwc:order": [
              "Diptera"
            ],
            "dwc:phylum": [
              "Arthropoda"
            ]
          },
          "resource_type": {
            "title": "Journal article",
            "type": "publication",
            "subtype": "article"
          },
          "journal": {
            "pages": "99-111",
            "title": "Netherlands Journal of Zoology",
            "volume": "41"
          },
          "alternate_identifiers": [
            {
              "identifier": "urn:lsid:taxodros.uzh.ch:id:driessen%20et%20al.%2C%201991"
            },
            {
              "identifier": "hash://md5/58fd5af87a78f16c995c987ea4ab390e"
            }
          ],
          "license": {
            "id": "cc-by-4.0"
          },
          "communities": [
            {
              "id": "biosyslit"
            },
            {
              "id": "taxodros"
            }
          ],
          "relations": {
            "version": [
              {
                "index": 0,
                "is_last": true,
                "parent": {
                  "pid_type": "recid",
                  "pid_value": "10778597"
                }
              }
            ]
          }
        },
        "title": "Host selection behaviour of the parasitoid Leptopilina clavipes, in relation to survival in hosts.",
        "links": {
          "self": "https://zenodo.org/api/records/10778598",
          "self_html": "https://zenodo.org/records/10778598",
          "self_doi": "https://zenodo.org/doi/10.5281/zenodo.10778598",
          "doi": "https://doi.org/10.5281/zenodo.10778598",
          "parent": "https://zenodo.org/api/records/10778597",
          "parent_html": "https://zenodo.org/records/10778597",
          "parent_doi": "https://zenodo.org/doi/10.5281/zenodo.10778597",
          "self_iiif_manifest": "https://zenodo.org/api/iiif/record:10778598/manifest",
          "self_iiif_sequence": "https://zenodo.org/api/iiif/record:10778598/sequence/default",
          "files": "https://zenodo.org/api/records/10778598/files",
          "media_files": "https://zenodo.org/api/records/10778598/media-files",
          "archive": "https://zenodo.org/api/records/10778598/files-archive",
          "archive_media": "https://zenodo.org/api/records/10778598/media-files-archive",
          "latest": "https://zenodo.org/api/records/10778598/versions/latest",
          "latest_html": "https://zenodo.org/records/10778598/latest",
          "draft": "https://zenodo.org/api/records/10778598/draft",
          "versions": "https://zenodo.org/api/records/10778598/versions",
          "access_links": "https://zenodo.org/api/records/10778598/access/links",
          "access_users": "https://zenodo.org/api/records/10778598/access/users",
          "access_request": "https://zenodo.org/api/records/10778598/access/request",
          "access": "https://zenodo.org/api/records/10778598/access",
          "reserve_doi": "https://zenodo.org/api/records/10778598/draft/pids/doi",
          "communities": "https://zenodo.org/api/records/10778598/communities",
          "communities-suggestions": "https://zenodo.org/api/records/10778598/communities-suggestions",
          "requests": "https://zenodo.org/api/records/10778598/requests"
        },
        "updated": "2024-03-04T17:00:53.809349+00:00",
        "recid": "10778598",
        "revision": 5,
        "files": [
          {
            "id": "b147e088-36a0-4ae3-96e5-ccf7139db5b9",
            "key": "Driessen et al., 1991.pdf",
            "size": 637155,
            "checksum": "md5:58fd5af87a78f16c995c987ea4ab390e",
            "links": {
              "self": "https://zenodo.org/api/records/10778598/files/Driessen et al., 1991.pdf/content"
            }
          }
        ],
        "owners": [
          {
            "id": 7292
          }
        ],
        "status": "published",
        "stats": {
          "downloads": 0,
          "unique_downloads": 0,
          "views": 0,
          "unique_views": 0,
          "version_downloads": 0,
          "version_unique_downloads": 0,
          "version_unique_views": 0,
          "version_views": 0
        },
        "state": "done",
        "submitted": true
      }
    ],
    "total": 1
  },
  "aggregations": {
    "access_status": {
      "buckets": [
        {
          "key": "open",
          "doc_count": 1,
          "label": "Open",
          "is_selected": false
        }
      ],
      "label": "Access status"
    },
    "resource_type": {
      "buckets": [
        {
          "key": "publication",
          "doc_count": 1,
          "label": "Publication",
          "is_selected": false,
          "inner": {
            "buckets": [
              {
                "key": "publication-article",
                "doc_count": 1,
                "label": "Journal article",
                "is_selected": false
              }
            ]
          }
        }
      ],
      "label": "Resource types"
    },
    "subject": {
      "buckets": [
        {
          "key": "Animalia",
          "doc_count": 1,
          "label": "Animalia",
          "is_selected": false
        },
        {
          "key": "Arthropoda",
          "doc_count": 1,
          "label": "Arthropoda",
          "is_selected": false
        },
        {
          "key": "Biodiversity",
          "doc_count": 1,
          "label": "Biodiversity",
          "is_selected": false
        },
        {
          "key": "Diptera",
          "doc_count": 1,
          "label": "Diptera",
          "is_selected": false
        },
        {
          "key": "Insecta",
          "doc_count": 1,
          "label": "Insecta",
          "is_selected": false
        },
        {
          "key": "Taxonomy",
          "doc_count": 1,
          "label": "Taxonomy",
          "is_selected": false
        },
        {
          "key": "flies",
          "doc_count": 1,
          "label": "flies",
          "is_selected": false
        },
        {
          "key": "fruit flies",
          "doc_count": 1,
          "label": "fruit flies",
          "is_selected": false
        }
      ],
      "label": "Subjects"
    },
    "file_type": {
      "buckets": [
        {
          "key": "pdf",
          "doc_count": 1,
          "label": "PDF",
          "is_selected": false
        }
      ],
      "label": "File type"
    }
  },
  "links": {
    "self": "https://zenodo.org/api/records?allversions=True&page=1&q=files.entries.checksum%3A%22md5%3A58fd5af87a78f16c995c987ea4ab390e%22&size=25&sort=bestmatch"
  }
}
jhpoelen commented 4 months ago

note that the produced "self" link by zenodo is not a valid URI due to unescaped whitespaces .

          "self": "https://zenodo.org/api/records/10778598/files/Driessen et al., 1991.pdf/content"
jhpoelen commented 4 months ago

After "manually" url encoding the self url in Preston code, the pdf was able to be retrieved:

preston cat --remote https://zenodo.org 'hash://md5/58fd5af87a78f16c995c987ea4ab390e' > document.pdf
[https://zenodo.org/api/r...0e%22&all_versions=true] 100.0% of 5 kB at 0.05 MB/s completed in < 1 minute
[https://zenodo.org/api/r....%2C%201991.pdf/content] 100.0% of 622 kB at 1.10 MB/s completed in < 1 minute

Note that "%20" encoding introduced by preston.

with verified matching md5 signatures

$ cat document.pdf | md5sum
58fd5af87a78f16c995c987ea4ab390e  -

as expected.

jhpoelen commented 4 months ago

after installing preston v0.8.4 on linker.bio, the expected pdf was loaded via:

https://linker.bio/hash://md5/58fd5af87a78f16c995c987ea4ab390e

image