get db metadata to check for completeness

lidiabressan commented 1 year ago

ciao,

I would like to check if my db is complete or how much is complete or has missing fields. The db is grib, for a couple of years, selected model levels and variables.

As far as I understand, my solutions are: 1) dump all metadata and check everything with arki-query --dump and then process the text (multiple lines for each datum, very long text for a couple of years db, ...), 2) loop over products, origin, reftimes, timeranges with arki-query --dump --summary --summary-restrict levels and check everything (shorter text but same procedure multiple times).

Is there a way to get an output of only the unique parameters of metadata on a row?

Do you have any advice?

Thanks

Lidia

spanezz commented 1 year ago

I understand that you need to check if, in an arkimet dataset, for each day you have data for a whole set of products and levels, on each day (or every 6 hours, or the actual model output interval). Is that understanding correct?

lidiabressan commented 1 year ago

yes, hourly analysis and forecast, different variables, some superficial, some on different levels (and we go back to 2018, still the reference year for aq). I know in advance variables with levels, and the origin discriminates between analysis and forecast.

spanezz commented 1 year ago

I cannot think of any existing functionality to do something that specific out of the box, and it should be reasonably doable with a bit of Python.

This is an example script that queries a dataset at regular intervals and checks that there is some data for all intervals. It could be a good base from which to build to check your required combinations of metadata:

#!/usr/bin/python3
import argparse
import datetime
from collections import defaultdict

import arkimet as arki

class Instant:
    def __init__(self):
        self.levels = set()
        self.products = set()

class Checker:
    def __init__(self):
        self.instants = defaultdict(Instant)

    def on_metadata(self, md):
        reftime = md.to_python("reftime")["time"]
        instant = self.instants[reftime]
        try:
            instant.levels.add(md["level"])
        except KeyError:
            pass
        try:
            instant.products.add(md["product"])
        except KeyError:
            pass

    def report(self):
        begin = min(self.instants)
        until = max(self.instants)

        cur = begin
        while cur <= until:
            try:
                instant = self.instants.get(cur)
                if instant is None:
                    print("data missing for reftime", cur)
                    continue
                if not instant.levels:
                    print("levels missing at reftime", cur)
                if not instant.products:
                    print("products missing at reftime", cur)
            finally:
                cur = cur + datetime.timedelta(hours=1)

def main():
    parser = argparse.ArgumentParser(description="check a dataset for completeness")
    parser.add_argument("dataset", action="store", help="Path to the dataset")
    args = parser.parse_args()

    checker = Checker()

    with arki.dataset.Session() as session:
        cfg = arki.dataset.read_config(args.dataset)
        with session.dataset_reader(cfg=cfg) as ds:
            ds.query_data("reftime:every 1 hour", on_metadata=checker.on_metadata)

    checker.report()

if __name__ == "__main__":
    main()

lidiabressan commented 11 months ago

Thanks !

I tried to adapt it to my case, but I struggled with the documentation about the metadata and I have a couple of questions:

from various prints (print(dir(md))), I discovered:

md.to_python gives a dictionary. Which are the dictionary keys to get the values? I coul not find them.
md.to_string gives a string, which is however different from the string used for queries: strings for queries, also in python, are as in the arkimet command line "GRIB1,x,x,x" but python api returns different strings "GRIB1(00x, 00x, 00x)"? could they be used for queries too ?
md["level"] and md.to_string("level") is the same ?
can I get values too or should I extract them from the dictionary ? Are these in the documentation ? Where could I find them ?

About the dataset:

with arki-query I can get information also about a grib file. Can i use this script with dataset = grib: file.grib as in the command line ? I tried but could not make it work.

thanks

spanezz commented 11 months ago

As a general pointer, which doesn't answer your questions at the moment, the existing documentation for the Metadata class in Python can be found here: https://arpa-simc.github.io/arkimet/python/arkimet.html#arkimet.Metadata

The dictionary keys for to_python are different for each metadata type (origin, level, ...) and for each style of metadata type (grib product, bufr product, ...). There is no detailed documentation of the representation as a dictionary, and print() is currently the best way to explore their layout. Some general documentation of metadata types and styles can be found at https://arpa-simc.github.io/arkimet/metadata.html

md.to_string gives the string one sees in arki-query --yaml, which are indeed different than what one could use for queries, althoug there are many things in common since the queries need to match the data that one sees in arki-query --yaml. The syntax of queries is documented here: https://arpa-simc.github.io/arkimet/matcher.html

md["level"] and md.to_string("level") are the same, yes

If you need values for levels you can extract them from the dictionary of to_python, depending what you need them for. Level information as stored by arkimet, as with any other metadata types, are the bare minimum that arkimet can use to distinguish data in the datasets. They might not be comprehensive descriptions, although they tend to contain useful information

You should be able to open a grib file as if it were a dataset by passing its path to read_config. For example:

import arkimet

with arkimet.dataset.Session() as session:
    cfg = arkimet.dataset.read_config("file.grib")
    ...

lidiabressan commented 11 months ago

one more question:

if arki-query command-line, I can query more dataset by listing them (arki-query '' dataset1 dataset2).

Can I put a list of dataset in args.dataset?

with arki.dataset.Session() as session: cfg = arki.dataset.read_config(args.dataset)

ARPA-SIMC / arkimet

get db metadata to check for completeness #313