Closed bondjimbond closed 6 years ago
Yes, it might be, using an implementation of hook_islandora_derivative_alter(). Let me look a bit deeper.
@bondjimbond can you test the issue-46 branch? Documentation is in the "Triggering derivative generation" section of the readme, but TLDR is
drush islandora_datastream_crud_generate_derivatives --user=admin --source_dsid=OBJ --pid_file=/tmp/issue-46-pids.txt --skip_dsids=TN
In this case, the TN should not be regenerated but the rest of the derivative datastreams should be.
I haven't tested this during other operations like batch ingest to make sure that its limiting of derivative generation doesn't bleed over, but I'll do that over the holidays.
@bondjimbond hold off, I found a problem. Addressing.....
OK, back on track. Test when able.
First test: skip the TN datastream.
Method:
Result:
Please be patient, generating derivatives from the OBJ datastream for [ok] islandora:71 WD islandora_ocr: Tesseract failed to create an HOCR datastream.Error: 1Command:[error] /usr/bin/tesseract /tmp/islandora_71_OBJ.tif /tmp/islandora_71_OBJ.tif -l eng hocr 2>&1Output: Tesseract Open Source OCR Engine v3.03 with LeptonicaTIFFstream: Sorry, can not handle image.Error in pixReadFromTiffStream: failed to read tiffdataError in pixReadStreamTiff: pix not readError in pixReadStream: tiff: no pix returnedError in pixRead: pix not readError in pixGetInputFormat: pix not definedReading /tmp/islandora_71_OBJ.tif as a list of filenames...Error in fopenReadStream: file not foundError in pixRead: image file not foundImage file II cannot be read!Error during processing. WD islandora_ocr: Tesseract failed to create OCR datastreams.Error: 1Command: [error] /usr/bin/tesseract /tmp/islandora_71_OBJ.tif /tmp/islandora_71_OBJ.tif -l eng 2>&1Output: Tesseract Open Source OCR Engine v3.03 with LeptonicaTIFFstream: Sorry, can not handle image.Error in pixReadFromTiffStream: failed to read tiffdataError in pixReadStreamTiff: pix not readError in pixReadStream: tiff: no pix returnedError in pixRead: pix not readError in pixGetInputFormat: pix not definedReading /tmp/islandora_71_OBJ.tif as a list of filenames...Error in fopenReadStream: file not foundError in pixRead: image file not foundImage file II cannot be read!Error during processing. Created JP2 derivative. [ok] Created TECHMD derivative for (islandora:71). [ok] Failed to generate HOCR from OBJ for islandora:71. [error] Generated PDF on islandora:71. [ok] Failed to generate OCR from OBJ for islandora:71. [error] Created JPG derivative. [ok]
So -- successfully generated the JP2 and JPG datastreams, successfully skipped the TN datastream. But seems to have some trouble with the OCR; I'm not sure if that's because OCR already exists, or not. Will try again after deleting OCR.
Test 2: Deleted JP2, JPG, TN, and OCR. Tried again with --skip_dsids=TN
Result: Same error - failed to generate OCR.
Test 3: Checked out the 7.x branch and ran again. Same OCR problem, so I guess it's nothing to do with the new code.
Test 4: Back to issue-46. --skip_dsids=OCR
Result: Success; skipped OCR. Still tried to create HOCR, got an error.
Test 5: --skip_dsids=OCR,HOCR
Result: Success; skipped attempts to generate OCR and HOCR.
Final result: Looks good, but I don't know why I was getting OCR errors in the first place. (This is on a pretty fresh Vagrant machine.)
Probably mergeable?
side note: Good timing for this to be working! I started this process on my batch of newspaper pages over two weeks ago when I opened this issue, and today it's still only a fraction of the way through (on object ~27000 of ~81000). I killed the process and am now running this branch without OCR.
Just for kicks, what happens when you try to regenerate all derivatives for a page using the Islandora GUI? Posts like this one suggest that there is source file missing.
Newspapers are brutally slow. If you can generate JP2s prior to ingest, things speed up, but OCR/HOCR are the real bottlenecks.
Hmm. Regenerate derivatives via UI for the Newspaper Page gives an Error:
The specified file /tmp/islandora_71_OBJ.tif.hocr could not be copied, because no file by that name exists. Please check that you supplied the correct filename.
Regenerate derivatives on the Newspaper Issue also results in Error:
The specified file /tmp/islandora_71_OBJ.tif.hocr could not be copied, because no file by that name exists. Please check that you supplied the correct filename. The specified file /tmp/islandora_72_OBJ.tif.hocr could not be copied, because no file by that name exists. Please check that you supplied the correct filename. The specified file /tmp/islandora_73_OBJ.tif.hocr could not be copied, because no file by that name exists. Please check that you supplied the correct filename.
@bondjimbond thanks to the detailed testing, I'll merge this.
In cases of paged content objects, where default behaviour is to generate OCR, it would be great to have the option to skip the OCR step.
In my case, a batch ingest of newspaper issues failed to create JPEG and TN derivatives for the Newspaper Page objects. OCR went through just fine.
Using this module, I want to generate the image derivatives without regenerating the OCR, as OCR is resource-intensive, takes a long time, and is unnecessary (since OCR is already there).
Would it be possible to add an option to the islandora_datastream_crud_generate_derivatives command that skips generating certain DSIDs? e.g. --skip_dsid=OCR