earthcube / earthcube_utilities

crawl and assert data-repository metadata for search
0 stars 0 forks source link

s3 tools #81

Closed valentinedwv closed 1 year ago

valentinedwv commented 1 year ago

Tools:

We might consider some enviroment variables... should try to match gleaner

params**:

path_to_source.

Methods count (method: count-path) bucketutil count path_to_source
bucketutil count --cfg --source source

return count for {path}: 100

** bucket stats (method: stats) bucketutil stats bucket bucketutil stats --cfg bucketutil stats bucket --source name bucketutil stats --cfg --source name

return

stats for {s3} {bucket}
milled: (total count N)
   sourceA: n1
   sourceB: n2
summoned: (total count s)
  sourceA:s1
   sourceB: s2
...

** what URL's where downloaded and are in s3 for a source. (method: listSummonedUrls) bucketutil urls path_to_source bucketutil urls --cfg --source source

return

SHA         URL
somesha  someURL

download jsonld's from a list of urn's bucketutil download urn [..urn] Download one or more jsonld's from a list of urn's egurn.jsonld, optionally from missing report retrun

find file(s) based on the original url bucketutil sourceurl url Find file based on the 'X-Amz-Meta-Url' metadata in the bucketstore:

return

return:

flagged duplicates
URL  flaggedDuplicate  sha Date Path 
-Amz-Meta-Url      (true|false) SHA DATE OBJECT_NAME 

Cull old duplicates. bucketutil cull path_to_source if dupe is older than 7 days ago (configuratble) keep only the more recent ones from the same day

Disk Usage for a path. This can be done with minio client, a note in the documentation may be the best thing

ylyangtw commented 1 year ago

A couple of questions:

  1. For urls, I guess we don't need to implement source to path for urls since X-Amz-Meta-Url is not found under milled?
  2. For download, parts_from_urn(sha) doesn't work... urn doesn't contain ':' in the strings. How to use this method properly?