bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

show example on how to process an image corpus offline or offsite #139

Open jhpoelen opened 2 years ago

jhpoelen commented 2 years ago

related to @matdillen conversation at TDWG 2021 conference -

samples from live chat:

Jorrit Poelen @Quentin - I've experimented with streaming approaching to processing large image datasets without have to store all if it in one location. Preliminary results suggests that with limited compute /store resources, you can still analyze a large well-defined corpus of images of known provenance. You might already be familiar with the small example at https://jhpoelen.nl/bees . .

Mathias Dillen @Jorrit: I noted before that some work on this was done in iDigBio, e.g. this abstract https://doi.org/10.3897/biss.2.25699 But was there any follow-up or other documentation? I've not been able to find any.

Jorrit Poelen The Collins et al. work led up to the prototype I shared earlier. The earlier approach was described in Thessen, A.E. et al., 2018. 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration. PeerJ Computer Science .

Documentation of current approach (decentralized, verifiable, content-based data archives), can be found via https://jhpoelen.nl/bees and https://preston.guoda.bio .Happy to share more information if needed.

Erica Krimmel @Mathias, the GUODA resource referenced in that abstract is no longer active, but like Jorrit said has led to other solutions. Another, ad-hoc is to work with everything in the same cloud environment, like Google Collab + FigShare. Super fast access to image data and cloud computing. An example of that is this workshop the Smithsonian and iDigBio did at the last Botany conference: https://github.com/richiehodel/Botany2021_DLworkshop ^ that approach does involve storing images all in one place, although not using your personal computing resources to do so

Jorrit Poelen Note that many of the earlier instances of the GUODA infrastructure have been adopted by GBIF and other infrastructure. E.g., Apache Spark, Parquet formatted datasets, Jupyter Notebook, HDFS etc. I found these tools are powerful, but very expensive to maintain. . . and ironically, don't scale that well because of it. 10:03 AM (many of the earlier instances of the GUODA infrastructure) -> (earlier technologies used in earlier instanced of ... )

Jorrit Poelen 10:30 AM @Mathias / Erica - I'd love to have a live discussion on making it easier to access and cite image corpora. This is an active field of development, and hearing your perspective would be very useful. Also, perhaps we already have pragmatic solutions to make it easier to already analyze large image corpora without have to spend $$$ on hardware/staffing/network etc.

@Jorrit and Erica: thanks for the feedback. I wonder if work continued with offering the possibility to apply ML algorithms to iDigBio images (or other image stores) without a requirement to download/stream the images elsewhere? Erica's example still required copying the images to google drive, for example.

Mathias Dillen @Jorrit: Yes, we should have that discussion. We have an increasing trove of images, sitting on local servers and therefore difficult to process in bulk without local access (and hardware!).

Erica Krimmel +1 Jorrit and Mathias! I would get a lot out of hearing others' thoughts on this topic.

jhpoelen commented 2 years ago

fyi @ekrimmel

jhpoelen commented 2 years ago

@qgroom @seltmann do you happen to have a command-line image analysis program takes some image data and produces some results (e.g., some suggested classifications, trait measurements) ? If so, I can put together an example of how to analyze all UCSB-IZC images (or any other dwc-a or collection of dwc-a's) in a single command.

matdillen commented 2 years ago

Maybe just tesseract? Or am I misunderstanding the question?

jhpoelen commented 2 years ago

@matdillen that might work. Do you have a specific example with input, output and the tools + arguments + config (model?) that connects the two? I am a bit of a tesseract newbie, so a copy-paste example would be great!

matdillen commented 2 years ago
sudo apt-get install tesseract-ocr
tesseract filename.jpg output

You can add other arguments, such as languages and output modes, but this is the simplest method. The output is a txt file with only the captured text.

https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html

jhpoelen commented 2 years ago

@matdillen thanks for taking the time to share the example. Great to see that tesseract plays nice on the command-line.

More later on the example of how to build an image corpus ocr workflow on a well defined, citable image dataset of known provenance.

PietrH commented 2 years ago

I dug up an example of how I actually use tesseract on herbarium specimens:

tesseract RED_Roses_3_005.jpg RED_Roses_3_005 -l fra+lat -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ()"

In this case I know the language of my specimen, but I also want to capture latin (for scientific names), I only allow a fixed character set to dodge things like Cyrillic characters. In this case I wasn't interested in numbers, but you could easily do the opposite to just get numbers. Tesseract output requires cleaning, so you want to optimize your command for the information you are looking to extract. There are also a number of preprocessing parameters you can edit. For the best results you can train it on a specific typeface.

In fact, I think pre-processing, things like scaling, setting a threshold, that sort of thing is an important step that often precedes (or comes built in) these tools, that could also be applied to en entire corpus as preparation for an analysis pipeline.

jhpoelen commented 2 years ago

@matdillen @PietrH good to know! I agree that pre-processing steps are important to include. What kind of program do you use for that typically? Imagemagick ? ffmpeg? If you have some real world example, I'd very much like to incorporate them into the example workflow.

jhpoelen commented 2 years ago

btw - would you happen to you have the catalog number, collection code, institution code, occurrence id for the herbarium example?

PietrH commented 2 years ago

@jhpoelen I use both ffmpeg and image magick, this was quite an old example I just pulled from my gists.

Turns out, it's not an herbarium specimen after all, but a page from a book from the collection of the Royal Botanical Garden of Madrid. That explains why I didn't want any digits, didn't want to bother with getting rid of page numbers, but just wanted to extract the scientific names and vernaculars so I could couple those to their respective pages in the book.

jhpoelen commented 2 years ago

@PietrH Thanks for clarifying. In case you have a reference to the book and specific page, I'd be happy to include it. Also, good to know that ffmpeg amd image magick are continuing to flex their muscles after all these years.

matdillen commented 2 years ago

I used Irfanview before, but imagemagick should work as well and be easier to implement. The pre-processing I did was mainly reducing dimensions if the MP number was too high and increasing JPEG compression if the file size was too high. My thresholds were in function of the model I was running (Google Vision), so 70MP and, if I recall correctly, 4MB.

If you want a set to work on, this one has some language metadata as well. It may be a bit tricky to scrape it all off of Zenodo though.

jhpoelen commented 2 years ago

@matdillen thanks for pointing out your publication at:

Dillen Mathias. (2018). A benchmark dataset of herbarium specimen images with label data: Summary [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3697797

In looking at Data and links.csv row 4, I found reference to specimen http://data.rbge.org.uk/herb/E00065282 and associated media.

On retrieving the original digital image associated with the specimen using

$ curl -L "http://repo.rbge.org.uk/image_server.php?kind=1500&path_base64=L2hlcmJhcml1bV9zcGVjaW1lbl9zY2Fucy9FMDAvMDY1LzI4Mi81NzIyMzcuanBn" > E00065282_orig.jpg

, and comparing to the content associated via your associated publication via:

$ curl -L https://zenodo.org/record/1483641/files/E00065282.jpg > E00065282_zenodo.jpg

I found that the two images are different:

$ ls -1 E00065282* | xargs -L1 sha256sum
5db4e3a422df8796a59aee7aa3cc0ca12451aa690193ac374493ae1775ce17ec  E00065282_orig.jpg
d22eedf8781e15a2ce48bd2680f55d466b465306a5a6cd640b11e75868c9d183  E00065282_zenodo.jpg

I would have expected the original image to be bigger, but found the opposite:

$ ls -lha E00065282* 
[...] 111K Mar 17 11:27 E00065282_orig.jpg
[...] 1.4M Mar 17 11:27 E00065282_zenodo.jpg

I am assuming that you did some kind of processing to the original image. Did you keep the original as well as the processed image?

matdillen commented 2 years ago

If I recall correctly, I acquired those images as archive quality TIFFs and converted them myself to 50% quality JPEGs. So that would explain the difference. It's also possible that the IIIF presentation on their portal causes some changes.

Those TIFFs are also on Zenodo. I don't know what transformations RBGE did to generate their JPEGs.

jhpoelen commented 2 years ago

@matdillen thanks for clarifying. What is the provenance or origin of the TIFFs ? Did you have special access, or are these images available openly? Am curious to learn more about how the pieces of your extensive image corpus fit together.

matdillen commented 2 years ago

It varies. For RBGE, they allow users to acquire TIFFs through their portal. See the options at the bottom here: https://data.rbge.org.uk/search/herbarium/?specimen_num=70338&cfg=zoom.cfg&filename=E00065282.zip

I don't know for sure how it is set up, but it's possible they use TIFFs to generate the images presented through their IIIF endpoint. The 'Get TIFF' service in their portal then allows access to those TIFFs through a PHP script. I did not acquire the TIFFs this way, rather we transferred them ad hoc through a temporary folder in Google Drive. I haven't used the script in their portal, but I presume there are some limits set up for the 'get TIFF' script.

Most other collections that contributed to this dataset did it differently. I don't have detailed info on how they store TIFFs locally, but I can tell how I acquired them (always after personal contact with someone at the institution):

jhpoelen commented 2 years ago

@matdillen awesome to see how different institutions make use of their creativity and find ways to share their valuable digital assets. Thanks for elaborating - to me this shows the valuable of curating these image corpora and carefully document their origins.

jhpoelen commented 2 years ago

As I was working towards getting a demo together, I noticed that some of the URLs in your csv index look a little suspicious.

For instance:

$ preston track "https://zenodo.org/record/3697797/files/Data%20and%20links%20excl%20extensions.csv"
[...]
$ preston ls | preston grep "0003200.tif"
<urn:uuid:41ecfda0-8086-47de-bc22-bf64cd5d3a9c> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> <urn:uuid:41ecfda0-8086-47de-bc22-bf64cd5d3a9c> .
<urn:uuid:41ecfda0-8086-47de-bc22-bf64cd5d3a9c> <http://www.w3.org/ns/prov#wasInformedBy> <urn:uuid:402f5d61-64c6-4af3-a4c1-a474c86cf3bc> <urn:uuid:41ecfda0-8086-47de-bc22-bf64cd5d3a9c> .
<urn:uuid:41ecfda0-8086-47de-bc22-bf64cd5d3a9c> <http://www.w3.org/ns/prov#used> <hash://sha256/e1ef80add87b0493d31720e4f11bfdbd3e353d2b0a9f8727c9814e825912a362> <urn:uuid:41ecfda0-8086-47de-bc22-bf64cd5d3a9c> .
<urn:uuid:41ecfda0-8086-47de-bc22-bf64cd5d3a9c> <http://purl.org/dc/terms/description> "An activity that finds the locations of text matching the regular expression '0003200.tif' inside any encountered content (e.g., hash://sha256/... identifiers)."@en <urn:uuid:41ecfda0-8086-47de-bc22-bf64cd5d3a9c> .
<line:hash://sha256/e1ef80add87b0493d31720e4f11bfdbd3e353d2b0a9f8727c9814e825912a362!/L1700> <http://www.w3.org/ns/prov#value> "PreservedSpecimen,Ficus membranacea C. Wright,species,JM,http://herbarium.bgbm.org/object/B100003200,B 10 0003200,Ficus,membranacea,Jamaica,1894,1894-02-15,\"Jamaica, Blue Mountains\",\"Harris,W.\",,,,,B,1894-02-15,,,M. Vásquez Avila,,Middle and South America,,,,,,5221,,,,,,,,,,,,C. Wright,Plantae,,,Moraceae,,Herbarium Berolinense,,Plantae; Moraceae,,B,,Middle and South America,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,http://herbarium.bgbm.org/object/B100003200,,,,,,,,,,,,,,eng,,http://herbarium.bgbm.org/object/B100003200,https://zenodo.org/record/1479068/files/B 10 0003200.jpg,https://zenodo.org/record/1479068/files/B 10 0003200.tif,https://zenodo.org/record/1479068/files/B 10 0003200.json,https://zenodo.org/record/1479068/files/B 10 0003200_all.png,https://zenodo.org/record/1479068/files/B 10 0003200_sel.png,http://dx.doi.org/10.5281/zenodo.1479068,," <urn:uuid:41ecfda0-8086-47de-bc22-bf64cd5d3a9c> .

shows that line 1700 contains a urls with whitespaces.

For instance:

https://zenodo.org/record/1479068/files/B 10 0003200.tif

for instance, Zenodo's webserver does like the url request sent via curl:

$ $ curl -I "https://zenodo.org/record/1479068/files/B 10 0003200.json"
HTTP/1.0 400 Bad request
Cache-Control: no-cache
Connection: close
Content-Type: text/html

I would expect whitespaces in URLs to be escaped using %20

https://zenodo.org/record/1479068/files/B%2010%200003200.tif

HTTP/1.1 200 OK
Server: nginx
Content-Type: image/tiff
Content-Length: 48026662
Content-MD5: 1fd820b4b016ae799c6a36c45da2f68a
Content-Security-Policy: default-src 'none';
X-Content-Type-Options: nosniff
X-Download-Options: noopen
X-Permitted-Cross-Domain-Policies: none
X-Frame-Options: sameorigin
X-XSS-Protection: 1; mode=block
Content-Disposition: attachment; filename="B 10 0003200.tif"
ETag: "md5:1fd820b4b016ae799c6a36c45da2f68a"
Last-Modified: Wed, 16 Mar 2022 14:35:25 GMT
Date: Fri, 18 Mar 2022 14:08:54 GMT
Accept-Ranges: none
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 59
X-RateLimit-Reset: 1647612595
Retry-After: 60
Strict-Transport-Security: max-age=0
Referrer-Policy: strict-origin-when-cross-origin
Set-Cookie: session=2eb12e8dbd8aab02_62349276.TBC7mkxax-Te-Pc22uWh1gODNTw; Expires=Mon, 18-Apr-2022 14:08:54 GMT; Secure; HttpOnly; Path=/
X-Session-ID: 2eb12e8dbd8aab02_62349276
X-Request-ID: e6c51a3512c0f51588031b6d2e1153eb

Any chance I can convince you to publish an updated version of the summary publication?

jhpoelen commented 2 years ago

@seltmann suggested that processing in Google's AutoML https://cloud.google.com/products/ai https://cloud.google.com/automl might be useful.

So, I was thinking, in addition to tesseract and similar command-line tools, you should be able to do

track images  | automl | track results 
matdillen commented 2 years ago

@jhpoelen I'll update the Zenodo record so the URLs have formatted spaces. Still, I thought there was a way curl would do this for you?

jhpoelen commented 2 years ago

@matdillen Thanks for making the update to your image corpus index file. For some reason, curl doesn't insert the %20 when it encounters whitespaces in a URL. I believe that browsers like firefox and chrome do reformat the URLs when whitespaces are encountered.

jhpoelen commented 2 years ago

@matdillen @PietrH Am in process of building the "dillen 2018" image dataset. . . takes a day or so to retrieve the tiff images from Zenodo across the Atlantic via my 500Mb/s fiber internet connection. Once this is done, you should be able to clone the images from wherever convenient image repository (e.g., local hard disk, internet archive, institutional image repository), or just use Zenodo as a "remote" or image repository. This way, you'd have a lightweight publication including the json snippets and csv summary, and have the ability to inflate the images when needed via secure references by md5 hashes.

In order to make this work I had to implement #152 and #149 . Your herbarium image corpus has really helped to drive the development of these Preston features.

Can you feel the excitement?

jhpoelen commented 2 years ago

@matdillen @PietrH you can now find an example of how to process the Dillen 2018 image corpus via:

https://github.com/bio-guoda/preston-dillen-2018

In the readme, I've included a method to generate a thumbnail from a TIFF image in your corpus.

Please confirm that you can reproduce and curious to hear your thoughts.

jhpoelen commented 2 years ago

@matdillen asked the following questions in a separate thread -

  • How long did it take you to scrape the set off of Zenodo?

start of process:

$ preston ls --algo md5 --remote https://raw.githubusercontent.com/bio-guoda/preston-dillen-2018/main/data | grep -o -E "2022-03-[0-9]{2}T[0-9:.]*" | head -n1
2022-03-23T21:16:44.606

end of process:

$ preston ls --algo md5 --remote https://raw.githubusercontent.com/bio-guoda/preston-dillen-2018/main/data | grep -o -E "2022-03-[0-9]{2}T[0-9:.]*" | tail
2022-03-24T18:59:50.907
2022-03-24T19:00:38.611
2022-03-24T19:01:17.955
2022-03-24T19:01:58.863
2022-03-24T19:02:30.410
2022-03-24T19:03:05.947
2022-03-24T19:48:23.438
2022-03-24T19:48:25.162
2022-03-25T13:46:15.043
2022-03-25T13:49:01.494

Note that the last timestamps are space out due to manual patching of the intermittently available Zenodo assets (see https://github.com/bio-guoda/preston-dillen-2018/issues/1).

So all, and all, the period was about 2022-03-23T21:16:44/2022-03-24T19:48:25 , so about 24 hours. Process was running sequential downloads from a 10+ year old laptop in the US Midwest using a consumer fiber internet connection connected over wifi.

  • Any other access problems than the weird 404s already discussed in the issue on github?

You can find missing content by looking for skolemized blank hasVersion statements. Here's two:

$ preston ls --algo md5 --remote https://raw.githubusercontent.com/bio-guoda/preston-dillen-2018/main/data | grep "well-known" | grep hasVersion | head -n2
<https://zenodo.org/record/1485797/files/K000242782.json> <http://purl.org/pav/hasVersion> <https://deeplinker.bio/.well-known/genid/701a8262-1549-32d1-a95b-fed8cc0b811d> <urn:uuid:71328f10-6f7d-4051-b883-f4127af1248a> .
<https://zenodo.org/record/1483808/files/E00189317.tif> <http://purl.org/pav/hasVersion> <https://deeplinker.bio/.well-known/genid/bbde6fe8-023d-3d88-b04f-d960d125e420> <urn:uuid:805bd303-f22d-43aa-9601-08a394a3bbd3> .

I re-ran the tracker for missing content until all eventually resolved.

The total amount of "blanks" was 19

$ preston ls --algo md5 --remote https://raw.githubusercontent.com/bio-guoda/preston-dillen-2018/main/data | grep "well-known" | grep hasVersion | wc -l
19

and, I haven't carefully looked at the root cause of all of them. I just reported the one in https://github.com/bio-guoda/preston-dillen-2018/issues/1 .

  • How scalable do you think this would be?

You can parallelize the tracking process across different machines. So, I'd say that process scales on the client side. However, I imagine that the Zenodo infrastructure has some limits. Note that once you make a location agnostic version of your image corpus, you can also scale the data provider end of things, bit-torrent style, or by using Content Distributions Network (CDN) strategies. The tricky thing is be careful make a mirror of the original location based dataset.

Could be interesting to try and harvest the other Zenodo dataset compiled in the ICEDIG project, https://zenodo.org/communities/belgiumherbarium/ This one is ca. 250-260k records with an image (JPEG, no TIFF) in each record, and sparse metadata (although you could enrich these with GBIF data). Quite a few things went wrong during the publication process of this dataset, though, so a significant number of records will be broken at different steps of the publication process. Due to leaks in the logging, I didn't have a clear picture at the time which records were faulty.

Yes, tracking your https://zenodo.org/communities/belgiumherbarium/ image corpus sounds like fun. I can see how this process can serve many purposes - review of mobility of large image datasets, demonstration of ability to cite large datasets without having to keeping the data in a single location, or reviewing the access dynamics of data repositories and the network connects it to the data tracker.

  • I wonder how this approach holds up to taking images off of GBIF's image cache or from the distributed image URLs as indexed on GBIF.

I don't see any issue on the Preston side of things. Preston implements content addressed storage in combination with lightweight, stream-enabled, provenance logs that are securely linked. Preliminary results show that this design strategy fits nicely into the massively scalable infrastructures built on Apache Kafka, Apache Spark, (content-addressed) blob storage, and friends.

But hey, let's give it a try!

Right now, my focus is on @seltmann 's https://big-bee.net/ to build image corpora and my collaboration with University of Florida (@mielliott) on the Preston core. And your use-case / image corpus fits nicely for both projects.

Curious to hear your thoughts and ideas!

qgroom commented 2 years ago

I hope you will blog about all of this when you're done ;-)

jhpoelen commented 2 years ago

@qgroom great idea! What do you think would be a good blog for this?

jhpoelen commented 1 year ago

Hey @matdillen - how's life? I have a 200GB image dataset of yours sitting on one of my hard disks. And, I was wondering - what would you like to do with this copy of the images that you also posted on https://github.com/bio-guoda/preston-dillen-2018 ? Or would you be ok if I ditch it?

jhpoelen commented 1 year ago

Also see Poelen, Jorrit H., & Best, Jason. (2023, June 2). Signed Biodiversity Data Packages: A Method to Cite, Verify, Mobilize, and Future Proof, Large Image Corpora. hash://sha256/0154b9ddce4d2e280e627a08d1a2d42884201af6ac1ec19606e393deda57f4bb hash://md5/bae7f441cdd2648d2356b2330e4b71e8. Zenodo. https://doi.org/10.5281/zenodo.7998190 .