bio-guoda / preston-brit-2022

experimental image corpus for BRIT
2 stars 0 forks source link

thumbnail and highquality image urls appear not the be included #3

Open jhpoelen opened 1 year ago

jhpoelen commented 1 year ago

fyi @jbest

As I was working on hashing images from your colleagues in Denver (see https://github.com/bio-guoda/preston/issues/193), I noticed a bug in the image url listing selection used for compiling the BRIT image corpus.

For related fix see https://github.com/bio-guoda/preston-dbg-2022/commit/ea0ff0649c76c630938dea73eacb029beddadb84 .

It appears that only images with url containing resize:1250 were indexed and hashed, so instead of indexing all of the image properties (e.g., http://rs.tdwg.org/ac/terms/accessURI" | "http://rs.tdwg.org/ac/terms/thumbnailAccessURI" | "http://rs.tdwg.org/ac/terms/goodQualityAccessURI"), only the accessURI appear to have been selected.

Here's some of the diffs in the image urls for first 10 images extracted.

$ diff <(zcat image-urls.tsv.new.gz | head) <(zcat image-urls.tsv.gz | head)
1,2d0
< https://bisque.cyverse.org/image_service/image/00-Bu6svkTKkNx5hdB8niSokV/resize:4000/format:jpeg
< https://bisque.cyverse.org/image_service/image/00-Bu6svkTKkNx5hdB8niSokV/thumbnail:200,200
4,5d1
< https://bisque.cyverse.org/image_service/image/00-DHTEnY6zjsfoUtwJTpbfiW/resize:4000/format:jpeg
< https://bisque.cyverse.org/image_service/image/00-DHTEnY6zjsfoUtwJTpbfiW/thumbnail:200,200
7,8d2
< https://bisque.cyverse.org/image_service/image/00-D9CTqQSfzuC553PxBwbt4d/resize:4000/format:jpeg
< https://bisque.cyverse.org/image_service/image/00-D9CTqQSfzuC553PxBwbt4d/thumbnail:200,200
10c4,10
< https://bisque.cyverse.org/image_service/image/00-SeP8n2R2TLz5RgjN8nSNLj/resize:4000/format:jpeg
---
> https://bisque.cyverse.org/image_service/image/00-SeP8n2R2TLz5RgjN8nSNLj/resize:1250/format:jpeg
> https://bisque.cyverse.org/image_service/image/00-7cMoZVRZFKzJ2m9XG48KbU/resize:1250/format:jpeg
> https://bisque.cyverse.org/image_service/image/00-SVyxqEkDDGPxqCor5QkAtf/resize:1250/format:jpeg
> https://bisque.cyverse.org/image_service/image/00-H85ihcY8yVLA6VGhLfEVVU/resize:1250/format:jpeg
> https://bisque.cyverse.org/image_service/image/00-v9QSTpZu4AfXHoeGkqHemZ/resize:1250/format:jpeg
> https://bisque.cyverse.org/image_service/image/00-e5AL6bRysVXD2BrfFHtKQc/resize:1250/format:jpeg
> https://bisque.cyverse.org/image_service/image/00-2X9wj8TBTmQXecdvnEdYaY/resize:1250/format:jpeg

This may account for lower expected volume of the images.

Apologies for catching this late. I've fixed the issue, and happy to re-run the indexing process. And . . . if performance is same as last time, this would take about 2 months.

Curious to hear your thoughts!

jhpoelen commented 1 year ago

image count before fix: $ zcat image-urls.tsv.gz | wc -l 826205

image count after fix:

$ zcat image-urls.tsv.new.gz | wc -l 2478642

jbest commented 1 year ago

For image preservation purposes, I don't think it's critical to retrieve the thumbnailAccessURI and goodQualityAccessURI images because those are lower resolution images derived from the full resolution image. So as long as you retrieve the full resolution image, you can regenerate the lower resolutions. That said, generating those can take some time and if you're relying on the corpus as a full back up and/or method for rapid recovery if the main repository is lost, then it would be worth retrieving those.

jhpoelen commented 1 year ago

@jbest thanks for responding, and yes, I agree that the higher resolution images should have priority over thumbnail and/or lower resolution. I notice that that image urls for the high quality image urls already include some kind of pre-processing. Is there a way to get to the raw original?

jhpoelen commented 1 year ago

Would:

curl https://bisque.cyverse.org/image_service/image/00-D9CTqQSfzuC553PxBwbt4d \
 > image-raw.jpg 

with

$ cat image-raw.jpg | sha256sum
75c4cc43fb504af40d3d7695763f95165950000dd0a30055d0c5916c59dbed8d  -

be the unaltered original of a reformatted image at

curl https://bisque.cyverse.org/image_service/image/00-D9CTqQSfzuC553PxBwbt4d/resize:4000/format:jpeg\
 > image-4000.jpg

with

$ cat image-4000.jpg | sha256sum
56afdb0d084d68068d7c01e1ce92ec0f96bfeee571023c19f909eaca64b78d2f  -

?

Oh, btw - do you want your coffee and cookie back now that we realized the work was not quite completed yet?

image-raw

image-4000

jbest commented 1 year ago

Yes, the first URL is the unaltered image. I didn't realize the Bisque URLs we had in the image records were resized and that probably was a big factor in how the size calculation was off. And no, the coffee and cookie are a small price to pay for working through the details of this dataset! Now that we know this, what is the next step? I think the ideal would be for the URL be corrected in the portal so the full, unprocessed image is used so then preston can re-index and retrieve the updated images. I'd be interested in if it would be better to re-index on your end then ship a drive, or if I could start with the current corpus and run preston on my end to get the new images.

jhpoelen commented 1 year ago

@jbest if it is doable to update the high quality urls to their raw bisque locations, and push an updated dwca with these uris in it, we can re-index the entire thing at whatever location would work best for you. The neat things about doing things across different locations is that you have additional peer review just by transferring the files from A to B.

But, in my mind, the first step is to figure out what to index and from where.

Can you update the brit dwca easily?

jhpoelen commented 1 year ago

Also, I think I'd be neat to repeat this exercise with another friendly herbarium collection, and perhaps even establish some kind of data review / archive protocol. But . . . I feel I am running ahead of myself here. . .

jhpoelen commented 1 year ago

Perhaps the Belgians are interested . . . fyi @qgroom @matdillen @PietrH

jhpoelen commented 1 year ago

Note that:

curl -I https://bisque.cyverse.org/image_service/image/00-D9CTqQSfzuC553PxBwbt4d

produced:

HTTP/1.1 200 OK
Server: nginx/1.14.1
Date: Fri, 03 Feb 2023 18:27:36 GMT
Content-Type: application/octet-stream
Content-Length: 5115140
Last-Modified: Thu, 03 Nov 2022 00:19:17 GMT
Content-Disposition: attachment; filename="BRIT67503.jpg"
Accept-Ranges: bytes
Expires: Fri, 17 Mar 2023 18:27:36 GMT
Cache-Control: public, max-age=3628800
ETag: "63630905-4e0d04"
Accept-Ranges: bytes

whereas

curl -I https://bisque.cyverse.org/image_service/image/00-D9CTqQSfzuC553PxBwbt4d/resize:4000/format:jpeg

produced

HTTP/1.1 200 OK
Server: nginx/1.14.1
Date: Fri, 03 Feb 2023 18:28:07 GMT
Content-Type: image/jpeg
Content-Length: 4686844
Last-Modified: Wed, 09 Oct 2019 21:53:59 GMT
Content-Disposition: filename="00-D9CTqQSfzuC553PxBwbt4d.size_4000,0,BL,.jpeg.jpg"
Accept-Ranges: bytes
Expires: Fri, 17 Mar 2023 18:28:07 GMT
Cache-Control: public, max-age=3628800
ETag: "5d9e56f7-4783fc"
Accept-Ranges: bytes

The content disposition tags, generated by the server hosting the images

Content-Disposition: filename="00-D9CTqQSfzuC553PxBwbt4d.size_4000,0,BL,.jpeg.jpg"

persuades a browser to try and render the content in the browser.

Whereas

Content-Disposition: attachment; filename="BRIT67503.jpg"

prompt a browser to offer a file download instead of attempting to render the image in the browser.

Expected is that, on changing the server configuration to return :

Content-Disposition: filename="BRIT67503.jpg"

instead (note no "attachment" mentioned), a web browser would try and render the image in place also.

jhpoelen commented 1 year ago

Note that Bisque source code is available at https://github.com/UCSB-VRL/bisqueUCSB .

jhpoelen commented 1 year ago

Hey @jbest The cyverse folks have been quite useful in helping to get me started with alternate method to access the bisque image originals. However, for some reason, I was unable to easily find the related files in the associated sftp shared folders . I probably missed something.

I started another tracking session with estimated speeds of about 1 image / s, about 5x faster than previous. With about 600k bisque hosted images in this dataset, we'd have 600k s ~ 167 hours, about a week. Not bad right? Perhaps just in time to present something at Digital Data 2023 at ASU in June?

However, I did find another method that appears to download bisque image blobs without doing any kind of processing.

Is there way for your to confirm the authenticity attached image retrieved from bisque with your originals? Or do we have to trust bisque hosted content? Here a sha256 hash embedded in the DwC-A media table would be helpful.

jhpoelen commented 1 year ago

<https://bisque.cyverse.org/blob_service/00-B3BAEtVZrvdsLEXFQhKpeG> <http://purl.org/pav/hasVersion> <hash://sha256/388e45da04899cd26b68dbd90c30c470882eadcd8e96ae455559095c17e75bcc> <urn:uuid:6fa3f369-c464-41cb-9a37-df3e323f8300>

with content retrieved from:

https://bisque.cyverse.org/blob_service/00-B3BAEtVZrvdsLEXFQhKpeG

with alternate location at:

https://linker.bio/hash://sha256/388e45da04899cd26b68dbd90c30c470882eadcd8e96ae455559095c17e75bcc

also, see attached retrieved via

$ preston cat --remote https://linker.bio hash://sha256/388e45da04899cd26b68dbd90c30c470882eadcd8e96ae455559095c17e75bcc > image.jpg
[https://linker.bio/hash:...e96ae455559095c17e75bcc] 4 MB at 1.79 MB/s completed in < 1 minute

image

jhpoelen commented 1 year ago

With associated record citing derived (processed) image locations -

preston history\
| tail -n1\
| preston cat\
| preston dwc-stream\
| grep 00-B3BAEtVZrvdsLEXFQhKpeG\
| jq .
{
  "http://www.w3.org/ns/prov#wasDerivedFrom": "line:zip:hash://sha256/734d4cdca40b737e39ecba46b40bb3ca324bb3404170dfdb46c94102f85a9776!/multimedia.csv!/L129",
  "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "http://rs.tdwg.org/ac/terms/Multimedia",
  "http://rs.tdwg.org/dwc/text/coreid": "6825982",
  "http://ns.adobe.com/xap/1.0/rights/UsageTerms": "CC BY-NC-SA (Attribution-NonCommercial-ShareAlike)",
  "http://rs.tdwg.org/ac/terms/providerManagedID": "urn:uuid:b7013043-77e5-4a52-ae99-58f9003cfb11",
  "http://rs.tdwg.org/ac/terms/associatedSpecimenReference": "https://sernecportal.org/portal/collections/individual/index.php?occid=6825982",
  "http://purl.org/dc/terms/rights": "http://creativecommons.org/licenses/by-nc/3.0/",
  "http://rs.tdwg.org/ac/terms/subtype": "Photograph",
  "http://purl.org/dc/terms/identifier": "https://bisque.cyverse.org/image_service/image/00-B3BAEtVZrvdsLEXFQhKpeG/resize:3744/format:jpeg",
  "http://rs.tdwg.org/ac/terms/metadataLanguage": "en",
  "http://ns.adobe.com/xap/1.0/rights/Owner": "Vanderbilt University Herbarium (VDB)",
  "http://rs.tdwg.org/ac/terms/comments": null,
  "http://ns.adobe.com/xap/1.0/rights/WebStatement": null,
  "http://ns.adobe.com/xap/1.0/MetadataDate": "2018-01-16 15:22:49",
  "http://rs.tdwg.org/ac/terms/thumbnailAccessURI": "https://bisque.cyverse.org/image_service/image/00-B3BAEtVZrvdsLEXFQhKpeG/thumbnail:200,200",
  "http://purl.org/dc/elements/1.1/creator": null,
  "http://rs.tdwg.org/ac/terms/caption": null,
  "http://purl.org/dc/terms/type": "StillImage",
  "http://purl.org/dc/terms/format": "image/jpeg",
  "http://rs.tdwg.org/ac/terms/goodQualityAccessURI": "https://bisque.cyverse.org/image_service/image/00-B3BAEtVZrvdsLEXFQhKpeG/resize:1250/format:jpeg",
  "http://rs.tdwg.org/ac/terms/accessURI": "https://bisque.cyverse.org/image_service/image/00-B3BAEtVZrvdsLEXFQhKpeG/resize:3744/format:jpeg"
}
jbest commented 1 year ago

@jhpoelen I can confirm that the file retrieved at: https://bisque.cyverse.org/blob_service/00-B3BAEtVZrvdsLEXFQhKpeG Is the same as the file we have archived locally (though the extension changed from "jpeg" to "jpg", not sure why): 388e45da04899cd26b68dbd90c30c470882eadcd8e96ae455559095c17e75bcc BRIT67545.jpg

Note this is not the same file as attached above in this thread (BRIT67503).

jhpoelen commented 1 year ago

Is the same as the file we have archived locally (though the extension changed from "jpeg" to "jpg", not sure why): 388e45da04899cd26b68dbd90c30c470882eadcd8e96ae455559095c17e75bcc BRIT67545.jpg

Good to hear that you were independently able to confirm that bisque is serving the unaltered file (aside from a filename difference)!

Note this is not the same file as attached above in this thread (BRIT67503).

Yes it is a different file, I just picked the first tracked image that the new brit-bisque tracker picked up, which happened to be BRIT67545.jpg .

Apologies for the confusion.

jhpoelen commented 1 year ago

Note that

https://linker.bio/hash://sha256/388e45da04899cd26b68dbd90c30c470882eadcd8e96ae455559095c17e75bcc

renders the image straight in the browser, no file download.

image

jhpoelen commented 1 year ago

So far,

brit-bisque$ preston ls -l tsv | grep hasVersion | pv -l > /dev/null
54.4k 0:00:06 [8.21k/s] [             <=>         

about 10% of all bisque related images have been tracked, with only about 10 unresponsive endpoints.

So, assuming the past is a predictor of the future, the estimated would be more like 2-3 weeks to get the images tracked at least once.

jhpoelen commented 1 year ago

Current status - about 260k images resolved:

$ preston ls | grep hasVersion | pv -l > /dev/null
 259k                                            ]

with an estimated 11 images to be missing or temporarily unavailable.

$ preston ls | grep hasVersion | grep well-known | pv -l > /dev/null
11.0 

I had to expand my server storage to 10TB at about 20 EUR a month, adding about 10 EUR a month to my overhead. Trying to think of ways to use this example to get other collections access to resilient image storage while being able to switch to a more suitable storage solution when needed / possible. Ideally, an image storage migration would not affect the way the images are referenced in digital collections as published through formats like DwC-A.

jhpoelen commented 1 year ago

Here's a recently resolved image:

https://linker.bio/hash://sha256/c82c9907154408d28d9736fc80e767019805b2a0423c3c2449fb83ffb0577cb0

as retrieved via https://bisque.cyverse.org/blob_service/00-4VVhJoR9oagYt245JTCEG9 as documented in line 909 of

hash://sha256/6734845363255328f82a3a13b8371102f7099eef8a187f8564808e587cb3dae8

or

line:hash://sha256/6734845363255328f82a3a13b8371102f7099eef8a187f8564808e587cb3dae8!/L909

with dynamically generated thumbnail available at

https://linker.bio/thumbnail:hash://sha256/c82c9907154408d28d9736fc80e767019805b2a0423c3c2449fb83ffb0577cb0

jhpoelen commented 1 year ago

@jbest status update . . .

Current index reference status (or "head") of BRIT Bisque hosted image indexing obtained

preston head

yielded:

hash://sha256/2e00e66fef844d8868de6bc0cc42528b16fefebe3fbc02fb28ea4633b2a7f128

for which the number of "hasVersion", relating locations to their observed content ids, statements are now up to about 440k . . .

preston ls --anchor hash://sha256/2e00e66fef844d8868de6bc0cc42528b16fefebe3fbc02fb28ea4633b2a7f128\
 | grep hasVersion\
 | pv -l

yielded:

 441k 0:06:06 [1.20k/s] [                                 <=>                  ]
jhpoelen commented 1 year ago

Now, after completion of tracking of the BRIT images with BisQue endpoint, the brit-bisque corpus has current version:

preston head

of

hash://sha256/7749ce80ee397178683715b2441589f44e0fd657580b24d17ef7b50dc9847922

and retrieving the content tracked in this version and their dependencies

preston ls --anchor hash://sha256/7749ce80ee397178683715b2441589f44e0fd657580b24d17ef7b50dc9847922\
 | grep hasVersion\
 | pv -l\
 > /dev/null 

yielded:

 596k 0:13:52 [ 716 /s] [