ARPA-SIMC / arkimet

A set of tools to organize, archive and distribute data files.
Other
15 stars 5 forks source link

`arki-check --unarchive` documentation #297

Closed brancomat closed 1 year ago

brancomat commented 1 year ago

There's no mention of the --unarchive option in https://arpa-simc.github.io/arkimet/datasets/archive.html (or in any other part of the doc).

The only mention I found is in the help and in the man page of arki-check

   --unarchive pathname  Given a pathname relative to .archive/last, move it
                         out of the archive and back to the main dataset

This is a bit misleading since by trial and error it seems that it accepts only specific filenames (no paths, no wildcards). Is this correct?

spanezz commented 1 year ago

Let's work out how it works first, then where to document it.

In theory, if you have somethign like datasets/lami123/.archive/last/2022/2022-12.grib, you can do this:

arki-check --unarchive 2022/2022-12.grib datasets/lami123

And this should move that segment into datasets/lami123/2022/2022-12.grib, and index it as part of the online dataset.

Does this match the behaviour you observe?

brancomat commented 1 year ago

Does this match the behaviour you observe?

yes. My question is if in the current implementation is possible to specify more than one file (directories or wildcard).

Side note: I tried a couple of things (admittedly, not very clever) that had an unexpected impact on lock file creation in the $dataset/$year directory (in this example: cosmo/2022), I don't know if it could be considered a bug:

$ ls cosmo/2022/ 
$ arki-check --unarchive 2022/\*.grib cosmo/
Traceback (most recent call last):
  File "/usr/bin/arki-check", line 11, in <module>
    main()
  File "/usr/bin/arki-check", line 7, in main
    sys.exit(Check.main())
  File "/usr/lib/python3.10/site-packages/arkimet/cmdline/base.py", line 83, in main
    return cmd.run()
  File "/usr/lib/python3.10/site-packages/arkimet/cmdline/check.py", line 133, in run
    arki_check.unarchive(pathname=self.args.unarchive)
RuntimeError: cannot rename /home/dbranchini@ARPA.EMR.NET/Scaricati/arkitest/cosmo/.archive/last/2022/*.grib to /home/dbranchini@ARPA.EMR.NET/Scaricati/arkitest/cosmo/2022/*.grib: No such file or directory
$ ls cosmo/2022/
'*.grib.lock'
$ arki-check --unarchive 2022/* cosmo/
Traceback (most recent call last):
  File "/usr/bin/arki-check", line 11, in <module>
    main()
  File "/usr/bin/arki-check", line 7, in main
    sys.exit(Check.main())
  File "/usr/lib/python3.10/site-packages/arkimet/cmdline/base.py", line 83, in main
    return cmd.run()
  File "/usr/lib/python3.10/site-packages/arkimet/cmdline/check.py", line 133, in run
    arki_check.unarchive(pathname=self.args.unarchive)
RuntimeError: cannot auto-detect format from file name 2022/*: file extension not recognised
$ ls cosmo/2022/
'*.grib.lock'  '*.lock'
spanezz commented 1 year ago

Right, yes, I see I have work to do to make it not just working, but also useable.

It makes sense to make it take segment names, and infer datasets from them.

I'll work on this

spanezz commented 1 year ago

In the issue297 branch there's a version of arkimet that adds the arki-maint command. arki-maint allows subcommands, and it currently only has the unarchive subcommand, which works like this:

arki-maint unarchive dataset/.archive/last/2022-*.grib

It will look for .archive/last in each of its arguments, infer the dataset directories and the segment names from that, and do the equivalent of running arki-check on each dataset and on each segment.

I did quite a bit of refactoring in command line parsing code to be able to share code between normal commands and commands with subcommands, that's why I'm pushing to a separate branch and not to master