NHMDenmark / DaSSCo-Tranche-1-work

DaSSCo Tranche 1 work
0 stars 1 forks source link

Cropped .pngs in Species-Web #40

Open k-zamzam opened 1 month ago

k-zamzam commented 1 month ago

Species-OCR generates two .png files for each processed folder: one original image of the folder's label and the other is a thresholded version of the original.

These .png files are uploaded to Species-Web and stored inside a Docker container called nhmaspecies-web, where Species-Web is running. The original .png file is saved to /app/image/label and the thresholded version is saved to /app/image/label_threshold

@PipBrewer you would like us to save these .png files. Where do we want to store these .png files and what for?

PipBrewer commented 1 month ago

At the moment images and data is kept on the Species server (just folder images?) and the local workstation. As the high res specimen images and associated metadata will be ingested and saved elsewhere (initially to N drive but eventually to ARS and ERDA) and the specimen data derived from OCR and GBIF will be exported (exports saved on N drive) and imported into Specify, we do not need to keep that specimen data on the Species servers or the local workstation after they have been fully processed. BUT folder images are not currently ingested and saved outside of Species-Web/local workstation. If we delete data to save space (and we need to on local workstations), we lose folder images as well.

However, it is possible that errors happen or mistakes are made during validation and we would need to go back and look at the folder images later. Or we might even want to improve and test OCR-algorithms and compare them against our original algorithms using this data set. Hence, it would be worth keeping copies of the folder images.

It might be less expensive to keep the cropped pngs rather than the original folder images. On the other hand, if an image is incorrectly determined to be a folder, we would need to retrieve the full original image and ingest it. The latter is an argument for keeping the original folder images somewhere.

Where do we store them if we do want to keep them? Should we keep them on the N drive? Alternatively, should we ingest them to the ARS and storage? Whilst the latter is probably ideal, we are unlikely to have the resources to do that as it will involve extra coding of the pipelines (especially as we need to identify from within folders on the local workstation and come up with a protocol to associate them correctly with the other images in the ARS). Hence, the simplest short-term fix would be to save the cropped-pngs on the N drive as exported direct from Species-Web server. We would delete everything else in Species-Web and local workstation after ingestion and after initial checks (e.g., for images incorrectly identified as folders).

@kimstp @bhsi-snm @beckerah Thoughts?

beckerah commented 1 month ago

It sounds like saving the cropped pngs on the N drive is the easiest option, but with such a high probability of incorrectly identifying a specimen image as a folder, it will absolutely be necessary to have access to the full "folder" images as well. Can we potentially save all the cropped pngs, as well as the tiffs for images that are interpreted as folders? If we're concerned about space, we could even go back and delete images from the N drive that are older than x amount of time (maybe 3-6 months?).

PipBrewer commented 1 month ago

@beckerah Rebekka and Matilde are in AU at the end of this month, perhaps you could have a zoom with one of them and together you could look through the E drive on the AU workstation and think how easy it would be to isolate the folder images and ingest them manually? Not sure if Aksel's program saves the subfolders it generates on the E drive or on a server. This is probably a no go (too complicated and manual), but probably worth checking.

beckerah commented 1 month ago

Rebekka and I have scheduled a zoom call for next Friday (Oct 25) while she's there to take a look at their computer with Charlotte. This will be an exploratory chat where we take a look at their E drive and figure out how everything's currently being handled and stored, so we can problem solve.