AAFC-BICoE / object-store-harvestor

MIT License
0 stars 1 forks source link

connecting conveyor images + identifiers + manually-captured data #51

Open heathercole opened 3 years ago

heathercole commented 3 years ago

Hi folks, @dshorthouse @cgendreau @ssbilkhu

I wanted to document this issue/use case, as I think it is relevant to a few different actions/issues in development. I think there are lots of possible solutions, but I am hoping WP3 can identify short- and long-term solution/outcome.

(as we know) All the conveyor belt-generated images have the same name, with uniqueness coming from the containing folder. Each picture taken with the imaging system creates a folder(unique name) with the .cr2 and .jpg inside (all .cr2 and jpg have same file names). There is already work happening for the requirements/workflows relating to the processing of these images, associated identifiers, and OCR-generated data.

Apart from the issues listed above, WP1 has another issue. With covid creating the requirements for remote-work, we had to find solutions to mobilize the above images into derivatives that could be shared for staff doing remote data entry (entering data from the specimen labels into a spreadsheet). To do this, lower-quality derivatives were made from the jpg files, using scripts (help from Amadou's team) they were rotated and re-named based on (but not identical to) the name of each file's containing folder. The script is being re-run every month or two to continue to provide images for remote work from the onsite imaging.

Because the plan is to use OCR to connect the specimen barcode (primary human-readable-identifier) to the specimen image (and related data records) AND because of the huge potential for human-error for people doing that manually; the 'identifier' used for the manual data capture is the image file name. I am hoping that we can review how we are going to connect the data-capture to the barcode identifier and the image.

I am hoping that we can identify a plan to connect these pieces together. It isn't urgent, except that I think solutions may be helped by or guide decisions made about the image and data processing development that is currently happening.

Besides connecting the disparate pieces, one idea may be for the OCR script to be run on these derivatives, which may more simply connect the ocr-read barcode only to the file-name, which can then directly connect it to the data-capture.

I will be importing the spreadsheet data into Specify as a holding place, so I am asking that someone advises on what 'identifier' should be used for that import, at the moment, I only have the folder-derived image file-name. I would need support if something else makes sense in order to ensure all the pieces can be connected later.

If someone could identify could identify a potential solution or whether further discussion is needed, that would be great. thanks

dshorthouse commented 3 years ago

@cgendreau Can I get you to move this to the object-store-harvestor repo?

heathercole commented 3 years ago

After speaking with @dshorthouse , he identified that the 'sidecar' file will contain what he has labelled the 'original file directory' (or similar). That info is all that is needed to create a connection between the manually captured data and the specimen image, which is in turn connected to the barcode via the OCR script.

At some point, some action may be required to script a connection between the 'original file directory' and the derivative of it used as the identifier for the manual capture.

As long as the sidecar file maintains that data, the connection can be made without too much issue, it doesn't seem like any further action is needed for this.