earthcube / earthcube_utilities

crawl and assert data-repository metadata for search
0 stars 0 forks source link

Changes to Summary Queries to eliminate prov #97

Open valentinedwv opened 1 year ago

valentinedwv commented 1 year ago

While there are duplicates because of the differences in the loading of the identifiers (old gleaner version)

  "sitemap_count": 650,
  "summoned_count": 1280,

The prov effects the count, too. "graph_urn_count": 2562, just adding to start figure out how to handle these.

{
  "source": "earthchem",
  "graph": "https://graph.geocodes-aws-dev.earthcube.org/blazegraph/namespace/test/sparql",
  "sitemap": "https://ecl.earthchem.org/sitemap.xml",
  "date": "2023-07-13",
  "bucket": "gleaner-wf",
  "s3store": "oss.geocodes-aws.earthcube.org",
  "sitemap_geturls_time": 1.1130366325378418,
  "s3_geturls_time": 4.255265474319458,
  "sitemap_count": 650,
  "summoned_count": 1280,
  "missing_sitemap_summon_count": 9,
  "missing_sitemap_summon": [
    "https://ecl.earthchem.org/view.php?id=65",
    "https://ecl.earthchem.org/view.php?id=67",
    "https://ecl.earthchem.org/view.php?id=68",
    "https://ecl.earthchem.org/view.php?id=69",
    "https://ecl.earthchem.org/view.php?id=71",
    "https://ecl.earthchem.org/view.php?id=73",
    "https://ecl.earthchem.org/view.php?id=74",
    "https://ecl.earthchem.org/view.php?id=239",
    "https://ecl.earthchem.org/view.php?id=240"
  ],
  "summon_list_s3_sha_time": 0.1945648193359375,
  "graph_sha_urn_time": 800.1752767562866,
  "graph_urn_count": 2562,
  "missing_summon_graph_count": 0,
  "missing_summon_graph": [],
  "milled_sha_time": 27.87134289741516,
  "milled_count": 1280,
  "missing_summon_milled": []
}
valentinedwv commented 1 year ago

I think this will change with a fix to the new pattern naming