s3 tools - Githubissues

Tools:

get list of summoned url's for a source
download jsonld's from a list of urn's
find file(s) based on the original url
statistics on duplicates of the original url's
cull old duplicates

We might consider some enviroment variables... should try to match gleaner

S3ADDRESS (and others)
S3Bucket

params**:

path_to_source.

A full url to https://oss..../gleaner/summon/geocodes_demo_datasets
or maybe just a path .. use enviroments... summon/geocodes_demo_datasets

Methods count (method: count-path) bucketutil count path_to_source
bucketutil count --cfg --source source

return count for {path}: 100

** bucket stats (method: stats) bucketutil stats bucket bucketutil stats --cfg bucketutil stats bucket --source name bucketutil stats --cfg --source name

return

stats for {s3} {bucket}
milled: (total count N)
   sourceA: n1
   sourceB: n2
summoned: (total count s)
  sourceA:s1
   sourceB: s2
...

** what URL's where downloaded and are in s3 for a source. (method: listSummonedUrls) bucketutil urls path_to_source bucketutil urls --cfg --source source

return

SHA         URL
somesha  someURL

download jsonld's from a list of urn's bucketutil download urn [..urn] Download one or more jsonld's from a list of urn's egurn.jsonld, optionally from missing report retrun

downloaded files named with urn.jsonld

find file(s) based on the original url bucketutil sourceurl url Find file based on the 'X-Amz-Meta-Url' metadata in the bucketstore:

There could be more than one
Be sure to save an additional file with metadata. (like jsonld.meta.text)

return

file named with urn_{filenameasfeurl).jsonld
metadata named with urn_{filenameasfeurl).jsonld.meta.text
statistics on duplicates of the original url's bucketutil duplicates path_to_source A long running tool to find/report on duplicate 'X-Amz-Meta-Url' Will flagg all but first dupe as a duplicate.

return:

flagged duplicates
URL  flaggedDuplicate  sha Date Path 
-Amz-Meta-Url      (true|false) SHA DATE OBJECT_NAME

Cull old duplicates. bucketutil cull path_to_source if dupe is older than 7 days ago (configuratble) keep only the more recent ones from the same day

Disk Usage for a path. This can be done with minio client, a note in the documentation may be the best thing

earthcube / earthcube_utilities

s3 tools #81