Closed jhpoelen closed 8 months ago
SERNEC RSS feed -
see prototype in development at https://github.com/bio-guoda/preston-sernec .
I created https://github.com/bio-guoda/preston-sernec . This repo contains today's snapshot of SERNEC associated dwc-a.
With that, I was able to estimate the total number of records with bisque images using:
https://github.com/bio-guoda/preston-sernec/blob/main/list-image-urls.sh
and
./list-image-urls.sh | tee image-urls.tsv
along with
cat image-urls.tsv | grep bisque | wc -l
to be:
9.99M
with an estimated 3.33M individual records estimatd via:
cat image-urls.tsv | grep bisque | grep accessURI | wc -l
Given that image transfer rate of bisque is known to be 1 image per 5 seconds, it'll take:
10M 5 / (3600 24) = 578 days to migrate all the image.
fyi @themerekat - I am curious to learn about your plans to migrate the image from Bisque Cyverse to alternate locations. I've also looped in @jbest .
You'll want to loop Ed Gilbert and Greg Post into this conversation
as @themerekat suggested -
Ed @egbot / Greg @GregPost-ASU - what are you plans to migrate the images from Cyverse before their contract expires? How are you planning to prevent this kind of situation in the (near) future?
I am assuming that image storage services will continue to come and go.
@GregPost-ASU, @egbot, @themerekat I'll add that the time for image download that @jhpoelen mentioned (5sec/image) is based on accessing the images using the public URL available in the SERNEC image records. Presumably there will be a much faster alternative for retrieving and copying images to a new platform.
@jbest yes, the transfer rate estimates are based on measurements from the perspective of an unprivileged user using public access methods [1]. I am curious to learn more about other ways to access the referenced image content.
[1] Botanical Research Institute Texas (BRIT): Origins of BRIT collection records and associated images tracked in period 2022-06/2022-07. hash://sha256/76d40abccfc71bc2cdaf4ea4a6003b9ac49123b27abe9f0d81e233299baf5e94 https://github.com/bio-guoda/preston-brit-2022 https://linker.bio/hash://sha256/76d40abccfc71bc2cdaf4ea4a6003b9ac49123b27abe9f0d81e233299baf5e94
@jhpoelen, @jbest We are working closely with CyVerse on how to migrate the data. We should be able to transfer directly from CyVerse's backend storage platform (vs. going through Bisque) so we expect the transfer to go pretty quickly.
@GregPost-ASU great to hear alternate methods exist to access the images.
Can you elaborate on how to access the original images and by pass bisque?
Also, how are you planning to keep the various image size rendering up and running (e.g., thumbnails)?
And, how would you verify that that your migration would actually be complete?
And, how are you planning to redirect the referenced image urls embedded in previously published dwc-a to their new content location?
Many questions, and I am very interested in this process, as I expect this to happen over and over again as image services go belly up or get retired.
I believe I have answers for all of these questions, and have solutions in place. So, to me, securing image access by performing verifiable migration (or data tracking) would be a fun and useful exercise to see how the https://github.com/bio-guoda/preston-brit-2022 example would scale up to SERNEC scale. Currently, I don't see any technical issues.
Curious to hear your thoughts.
Closing issue until @GregPost-ASU @jbest et al. are willing/able to continue.
fyi @jbest
South East Regional Network of Expertise and Collections (SERNEC) Thematic Collection Network (TCN), a collaboration that is digitizing and making data accessible for over 3 million plant specimens.