bio-guoda / preston

a biodiversity dataset tracker
MIT License
25 stars 1 forks source link

extend BRIT image archival to include SERNEC TCN collections #212

Closed jhpoelen closed 4 months ago

jhpoelen commented 1 year ago

fyi @jbest

South East Regional Network of Expertise and Collections (SERNEC) Thematic Collection Network (TCN), a collaboration that is digitizing and making data accessible for over 3 million plant specimens.

jhpoelen commented 1 year ago

related to https://github.com/preston-brit-2022

also see https://sernecportal.org/portal/collections/index.php

jhpoelen commented 1 year ago

SERNEC RSS feed -

https://sernecportal.org/portal/content/dwca/rss.xml

jhpoelen commented 1 year ago

see prototype in development at https://github.com/bio-guoda/preston-sernec .

jhpoelen commented 1 year ago

I created https://github.com/bio-guoda/preston-sernec . This repo contains today's snapshot of SERNEC associated dwc-a.

With that, I was able to estimate the total number of records with bisque images using:

https://github.com/bio-guoda/preston-sernec/blob/main/list-image-urls.sh

and

./list-image-urls.sh | tee image-urls.tsv

along with

cat image-urls.tsv | grep bisque | wc -l

to be:

9.99M

with an estimated 3.33M individual records estimatd via:

cat image-urls.tsv | grep bisque | grep accessURI | wc -l

Given that image transfer rate of bisque is known to be 1 image per 5 seconds, it'll take:

10M 5 / (3600 24) = 578 days to migrate all the image.

jhpoelen commented 1 year ago

fyi @themerekat - I am curious to learn about your plans to migrate the image from Bisque Cyverse to alternate locations. I've also looped in @jbest .

themerekat commented 1 year ago

You'll want to loop Ed Gilbert and Greg Post into this conversation

jhpoelen commented 1 year ago

as @themerekat suggested -

Ed @egbot / Greg @GregPost-ASU - what are you plans to migrate the images from Cyverse before their contract expires? How are you planning to prevent this kind of situation in the (near) future?

I am assuming that image storage services will continue to come and go.

jbest commented 1 year ago

@GregPost-ASU, @egbot, @themerekat I'll add that the time for image download that @jhpoelen mentioned (5sec/image) is based on accessing the images using the public URL available in the SERNEC image records. Presumably there will be a much faster alternative for retrieving and copying images to a new platform.

jhpoelen commented 1 year ago

@jbest yes, the transfer rate estimates are based on measurements from the perspective of an unprivileged user using public access methods [1]. I am curious to learn more about other ways to access the referenced image content.

references

[1] Botanical Research Institute Texas (BRIT): Origins of BRIT collection records and associated images tracked in period 2022-06/2022-07. hash://sha256/76d40abccfc71bc2cdaf4ea4a6003b9ac49123b27abe9f0d81e233299baf5e94 https://github.com/bio-guoda/preston-brit-2022 https://linker.bio/hash://sha256/76d40abccfc71bc2cdaf4ea4a6003b9ac49123b27abe9f0d81e233299baf5e94

GregPost-ASU commented 1 year ago

@jhpoelen, @jbest We are working closely with CyVerse on how to migrate the data. We should be able to transfer directly from CyVerse's backend storage platform (vs. going through Bisque) so we expect the transfer to go pretty quickly.

jhpoelen commented 1 year ago

@GregPost-ASU great to hear alternate methods exist to access the images.

Can you elaborate on how to access the original images and by pass bisque?

Also, how are you planning to keep the various image size rendering up and running (e.g., thumbnails)?

And, how would you verify that that your migration would actually be complete?

And, how are you planning to redirect the referenced image urls embedded in previously published dwc-a to their new content location?

Many questions, and I am very interested in this process, as I expect this to happen over and over again as image services go belly up or get retired.

jhpoelen commented 1 year ago

I believe I have answers for all of these questions, and have solutions in place. So, to me, securing image access by performing verifiable migration (or data tracking) would be a fun and useful exercise to see how the https://github.com/bio-guoda/preston-brit-2022 example would scale up to SERNEC scale. Currently, I don't see any technical issues.

Curious to hear your thoughts.

jhpoelen commented 4 months ago

Closing issue until @GregPost-ASU @jbest et al. are willing/able to continue.