SpeciesFileGroup / taxonworks

Workbench for biodiversity informatics.
http://taxonworks.org
Other
85 stars 25 forks source link

Enhancement - New Image Task & Grid digitizer should auto-recognize QR Codes on catalog numbers #1796

Open tmcelrath opened 3 years ago

tmcelrath commented 3 years ago

As a user of barcodes, I would like to use barcode-based catalog numbers to have TW automatically read catalog numbers off of catalog labels with QR codes and automatically assign catalog numbers and namespaces to specimens.

QR Code reader would: Read QR code If in structured namespace:number format, then prompt user (or have user set default?) namespace mapping, check for duplicate, then upon collection object creation, the catalog number would automatically be added to the collection object.

tmcelrath commented 3 years ago

img001 We are now using barcodes like this - use this example?

LocoDelAssembly commented 3 years ago

Just adding to keep in mind, the QR code have encoded the text in a single line (e.g. "INHS Insect Collection 1008451"). Perhaps an UI listing common prefixes and let user map to namespaces (with possibility to correct the prefix, as in this example one common prefix is "INHS Insect Collection 10084" for instance if no other filtering is considered).

mjy commented 3 years ago

@LocoDelAssembly I'm not sure I understand the issue. The QR code should be identical to Identifier#cached if we are doing it right. Your catalogNumber code should handle the string? How it is written in human readable could be any form.

LocoDelAssembly commented 3 years ago

The code in DwC would strip out "INHS Insect Collection" only if the namespace (previously selected via UI) has short_name (or verbatim) and/or delimiter matching the prefix text on QR code.

Also was thinking about multiple namespace on same picture, if it is supposed to be only one then perhaps just let the user type in the prefix text to discard (with some previewing if possible) and have a namespace picker.

Also-also, considering casing where /\D*(\d+)$/ is not compatible with the type of catalog "numbers" collections use.

mjy commented 3 years ago

In the back end we can isolate each square, this isn't a problem, so we can get the exact text we need.

tmcelrath commented 1 year ago

Any chance we can re-prioritize this? I had a summer intern accidentally add numbers non-sequentially to several thousands slides and it would majorly help one of my hourlies to have TW automatically populate the Catalog Number to save time.

tmcelrath commented 1 year ago

Barcode scanner might have scripts you can use: https://github.com/hyperoslo/BarcodeScanner

tmcelrath commented 1 year ago

Here is the script that Inselect uses: https://github.com/NaturalHistoryMuseum/inselect/blob/master/inselect/scripts/read_barcodes.py

tmcelrath commented 1 year ago

From @dshorthouse: ZBar is a good Barcode/QR code reader, https://zbar.sourceforge.net/. And, a ruby gem wrapper, https://github.com/willglynn/ruby-zbar. I've toyed with this and have noticed that it performs best with a PGM file format so I use MiniMagick to produce it in-memory before passing to ZBar. See https://github.com/dshorthouse/flatplant-barcodes/blob/main/barcode.rb#L15.

ChrisGrinter commented 1 year ago

Yes this would be a powerful workflow step. Inselect has failed to rename CAS images with Data Matrix labels, our labels read with no space "CASENT1234567", "CASLOT1234567", or "CASTYPE1".

mjy commented 1 year ago

I ran experiments with Zbar a while back, it would take some updates all the way down to the C level IIRC, but we'll see about approaching it again.

dshorthouse commented 1 year ago

@mjy Yep, I found the same - it threw errors on occasion when the input was jpg (plus, recognition wasn't superb). However, if you're able to marry it with MiniMagick and convert to pgm in memory, it's reasonably good. You'll still likely require segmentation before passing-in the many pgm as would be produced in the example image in this issue.

mjy commented 1 year ago

@dshorthouse thanks. Our Grid digitizer segments each slide (actually it's slide-agnostic) semi-automatically, so that is resolved. It would be nice, however, to do further ROI isolation for just the label containing the barcode. Suggestions welcome as to how to do that.

dshorthouse commented 1 year ago

@mjy You'd likely have to use OpenCV for this to find edges (is that what you use for the slide segmentation?). I bet you could do a reasonably good job with a white / less-white threshold kinda like this: https://github.com/dshorthouse/conveyor-imaging-pipeline/blob/main/lib/sidecar.rb#L135. Note that the ruby OpenCV world is a mess. I've had best success with ropencv (& not 2-3 others I found), but had to fork it & fix it so it'd play nice with OpenCV3, https://github.com/dshorthouse/ropencv.

Addendum:

Actually, looks like my fork may have merely been including the missing barcode extension, assuming I could just use it and forgo ZBar altogether. Sadly, the bindings are missing in ropencv so that was a bust.

mjy commented 1 year ago

@dshorthouse

is that what you use for the slide segmentation?

No, we draw a preset layout, that the user can tweak manually, if need be, between images for subtle differences, it works really well. Layout can be saved as preference too.

That UI is all a completely agnostic library btw, designed for anyone to use: https://github.com/SpeciesFileGroup/sled

dshorthouse commented 1 year ago

@mjy Right, so it's a question of scale & workflow. In my case, I had 400k images whose barcodes needed to be "read" because there was no appreciable transcription at that stage; barcode stickers were affixed to herbarium sheets just prior to image capture via a conveyor belt system. And, I had access to a cluster. The upfront investment in getting libraries like ZBar and OpenCV to work made economic sense because I could then hand-off the few residuals to humans and then it need not ever be run again. In your case, you'll have to evaluate if any/all of this is worth that investment because it'd introduce technical debt, only occasionally triggered by users.

So...

What about something like this? https://gruhn.github.io/vue-qrcode-reader/.

mjy commented 1 year ago

@dshorthouse we did a little looking around a couple months back, I think the Vue library came up, thanks for the reminder.

There are over 8k slide folder images, * ~12 is over 100k. Not sure how many of those have barcodes, but many more are coming. And the drag-drop add image interface (lets you drop many at a time) has added over 15k images without any help from us in the last couple months- so it's worth our time to get this figured out.

What does bug me is that we don't have this figured out in standard web-accessible libraries yet so that all software packages could be using the same bits in the digitization workflow. This is standard practice stuff, but obviously not mission critical or we would have it already!?

ChrisGrinter commented 1 year ago

@mjy this would be a feature that could help bring CAS on board for our BigBee TCN. All of our images have QR labels and we could just start dragging these photos into TW for skeletal record uploads + our TPT images for ~30k. But this + a long list of other "not mission critical" issues are keeping us at bay since current workflows in Symbiota are much more efficient. Dusting off data soon and hope to engage more!

mjy commented 1 year ago

since current workflows in Symbiota are much more efficient.

We really need to do a deep dive and better understand where the key innovations are in the software (as opposed to the SOP as to how the digitization physically happens). Does Symbiota have a report like in TaxonWorks that estimates time/specimen, IIRC it does? This would give us a basis to understand how far off we are, perhaps.

tmcelrath commented 1 year ago

From Ben Price: Just a quick note that the developer of Inselect also made Gouda (https://github.com/NaturalHistoryMuseum/gouda/releases/tag/v0.1.13) which can be called from the command line to read barcodes. Hopefully that can be incorporated into your workflow.

https://github.com/NaturalHistoryMuseum/gouda

ChrisGrinter commented 1 year ago

We really need to do a deep dive and better understand where the key innovations are in the software (as opposed to the SOP as to how the digitization physically happens). Does Symbiota have a report like in TaxonWorks that estimates time/specimen, IIRC it does? This would give us a basis to understand how far off we are, perhaps.

In the Add Skeletal Record feature you get a clock that calculates your specimens/minute/etc. Having it on screen during digitization is a nice feature and gets people to go faster. The skeletal feature in Symbiota is nice but not ideal, which pivots this thought into a new feature request. The Simple New Specimen on TW is a step in the right direction, but could be made into a super powerful and flexible tool for multiple workflows (should post or look into existing issues).