Option to skip generating certain derivatives?

bondjimbond commented 6 years ago

In cases of paged content objects, where default behaviour is to generate OCR, it would be great to have the option to skip the OCR step.

In my case, a batch ingest of newspaper issues failed to create JPEG and TN derivatives for the Newspaper Page objects. OCR went through just fine.

Using this module, I want to generate the image derivatives without regenerating the OCR, as OCR is resource-intensive, takes a long time, and is unnecessary (since OCR is already there).

Would it be possible to add an option to the islandora_datastream_crud_generate_derivatives command that skips generating certain DSIDs? e.g. --skip_dsid=OCR

mjordan commented 6 years ago

Yes, it might be, using an implementation of hook_islandora_derivative_alter(). Let me look a bit deeper.

mjordan commented 6 years ago

@bondjimbond can you test the issue-46 branch? Documentation is in the "Triggering derivative generation" section of the readme, but TLDR is

drush islandora_datastream_crud_generate_derivatives --user=admin --source_dsid=OBJ --pid_file=/tmp/issue-46-pids.txt --skip_dsids=TN

In this case, the TN should not be regenerated but the rest of the derivative datastreams should be.

I haven't tested this during other operations like batch ingest to make sure that its limiting of derivative generation doesn't bleed over, but I'll do that over the holidays.

mjordan commented 6 years ago

@bondjimbond hold off, I found a problem. Addressing.....

mjordan commented 6 years ago

OK, back on track. Test when able.

bondjimbond commented 6 years ago

First test: skip the TN datastream.

Method:

Delete JP2, JPG, and TN datastreams
Run islandora_datastream_crud with --skip_dsids=TN

Result:

Please be patient, generating derivatives from the OBJ datastream for [ok] islandora:71 WD islandora_ocr: Tesseract failed to create an HOCR datastream.Error: 1Command:[error] /usr/bin/tesseract /tmp/islandora_71_OBJ.tif /tmp/islandora_71_OBJ.tif -l eng hocr 2>&1Output: Tesseract Open Source OCR Engine v3.03 with LeptonicaTIFFstream: Sorry, can not handle image.Error in pixReadFromTiffStream: failed to read tiffdataError in pixReadStreamTiff: pix not readError in pixReadStream: tiff: no pix returnedError in pixRead: pix not readError in pixGetInputFormat: pix not definedReading /tmp/islandora_71_OBJ.tif as a list of filenames...Error in fopenReadStream: file not foundError in pixRead: image file not foundImage file II cannot be read!Error during processing. WD islandora_ocr: Tesseract failed to create OCR datastreams.Error: 1Command: [error] /usr/bin/tesseract /tmp/islandora_71_OBJ.tif /tmp/islandora_71_OBJ.tif -l eng 2>&1Output: Tesseract Open Source OCR Engine v3.03 with LeptonicaTIFFstream: Sorry, can not handle image.Error in pixReadFromTiffStream: failed to read tiffdataError in pixReadStreamTiff: pix not readError in pixReadStream: tiff: no pix returnedError in pixRead: pix not readError in pixGetInputFormat: pix not definedReading /tmp/islandora_71_OBJ.tif as a list of filenames...Error in fopenReadStream: file not foundError in pixRead: image file not foundImage file II cannot be read!Error during processing. Created JP2 derivative. [ok] Created TECHMD derivative for (islandora:71). [ok] Failed to generate HOCR from OBJ for islandora:71. [error] Generated PDF on islandora:71. [ok] Failed to generate OCR from OBJ for islandora:71. [error] Created JPG derivative. [ok]

bondjimbond commented 6 years ago

So -- successfully generated the JP2 and JPG datastreams, successfully skipped the TN datastream. But seems to have some trouble with the OCR; I'm not sure if that's because OCR already exists, or not. Will try again after deleting OCR.

bondjimbond commented 6 years ago

Test 2: Deleted JP2, JPG, TN, and OCR. Tried again with --skip_dsids=TN

Result: Same error - failed to generate OCR.

bondjimbond commented 6 years ago

Test 3: Checked out the 7.x branch and ran again. Same OCR problem, so I guess it's nothing to do with the new code.

bondjimbond commented 6 years ago

Test 4: Back to issue-46. --skip_dsids=OCR

Result: Success; skipped OCR. Still tried to create HOCR, got an error.

Test 5: --skip_dsids=OCR,HOCR

Result: Success; skipped attempts to generate OCR and HOCR.

bondjimbond commented 6 years ago

Final result: Looks good, but I don't know why I was getting OCR errors in the first place. (This is on a pretty fresh Vagrant machine.)

Probably mergeable?

bondjimbond commented 6 years ago

side note: Good timing for this to be working! I started this process on my batch of newspaper pages over two weeks ago when I opened this issue, and today it's still only a fraction of the way through (on object ~27000 of ~81000). I killed the process and am now running this branch without OCR.

mjordan commented 6 years ago

Just for kicks, what happens when you try to regenerate all derivatives for a page using the Islandora GUI? Posts like this one suggest that there is source file missing.

mjordan commented 6 years ago

Newspapers are brutally slow. If you can generate JP2s prior to ingest, things speed up, but OCR/HOCR are the real bottlenecks.

bondjimbond commented 6 years ago

Hmm. Regenerate derivatives via UI for the Newspaper Page gives an Error:

The specified file /tmp/islandora_71_OBJ.tif.hocr could not be copied, because no file by that name exists. Please check that you supplied the correct filename.

Regenerate derivatives on the Newspaper Issue also results in Error:

The specified file /tmp/islandora_71_OBJ.tif.hocr could not be copied, because no file by that name exists. Please check that you supplied the correct filename. The specified file /tmp/islandora_72_OBJ.tif.hocr could not be copied, because no file by that name exists. Please check that you supplied the correct filename. The specified file /tmp/islandora_73_OBJ.tif.hocr could not be copied, because no file by that name exists. Please check that you supplied the correct filename.

mjordan commented 6 years ago

@bondjimbond thanks to the detailed testing, I'll merge this.

SFULibrary / islandora_datastream_crud

Option to skip generating certain derivatives? #46