Closed CGillen closed 7 months ago
It's still kind of bad, but it seems like it's improved for some reason. I'm guessing it's some of the work @wickr did on #2885. I don't think we can really get any better without removing full text hit highlighting, and even then I'm not sure how much better it would be.
I think it seems faster, too
@CGillen @shieldsb
From my test this morning --
pauling
; results were displayed after 82 secondsGiven these results, I would suggest reopening this ticket for now, pending further conversation.
@petersec @CGillen The link in Chris' comment above is on production. Is there a similar test for staging?
@shieldsb @CGillen Presently, I don't think so. The General Catalogs collection on staging doesn't have any public objects in it.
@petersec @CGillen We are going to deploy to production on Monday, so let's leave this in QA until then and retest it after.
@shieldsb @CGillen I ran the same test today as was run on July 24:
it took 56 seconds to open the collection (using the link at top) in OD2 production
once opened, I conducted a search for pauling
; after five minutes, I received a 504 Gateway Time-out message. No search results were returned.
My solution to this will require a SUBSTANTIAL change to the system. We'd have to do a reindex & generate new derivatives for all content with OCR/extracted content. More than that, we'd also lose the ability to hit-highlight the OCR/extracted content within search results. Full text searching would be preserved.
Alternatively, we could skip OCR/text extraction altogether for these large PDFs and preserve it for the smaller ones. This would be a MUCH smaller change and could be done automatically based on a set length limit. This might lead to not being able to full text search the PDFs we skip, more testing would be necessary. Hit-highlighting would be preserved for smaller documents
Finally, we might look at updating Tesseract and re-doing ocr in hopes that it will cleanup and shrink the results a little, plus do some automatic cleanup, and maybe even ask for some manual cleanup. This would preserve both hit-highlighting and full text search, but might not actually accomplish all that much, especially compared to the first two options.
@jsimic @shieldsb This is going to need a discussion w/ POSM
Note for myself:
this branch (feature/ocrRewrite) is the beginning PoC to rewriting hOCR text as a derivative. It could use a small tweak to make sure hocr_text_tsimv
is still used until the complete reindex/derivative generation, but it's pretty much ready for hOCR content, extracted text would just need to be implemented
I ran the timing test again this morning, and the results are definitely better:
pauling
; results were displayed after 87 seconds. This time is similar to what was recorded on July 24. Still not ideal, but certainly better than the August 9 test [5-minute Gateway timeout]QA: This does not improve speed yet. That will come in the next PR. When QA'd move back to in-progress.
Check several PDF works:
QA pass. Site-wide search and UV search are working as expected on Staging.
@shieldsb - can you move this ticket back to In Progress, per @CGillen's note?
The sidekiq job finished on staging. The comment on April 24th can be used as QA for this. This looks like it should be good to go if those requirements pass.
QA: This does not improve speed yet. That will come in the next PR. When QA'd move back to in-progress.
Check several PDF works:
Side note. Searching in solr took 100ms rather than 5 seconds per item. So a markedly faster search.
Took 1.1 minutes to search for an in text match for a 700 page pdf. Same for 1 word and 3 words.
QA pass. Searches were consistent with @straleyb 's comments above across large pdfs.
Descriptive summary
Search results for works with large (h)OCR/extracted text load unacceptably slow. This is best seen when searching the OSU General Catalog collection. These works are dense articles with lots of text per page and hundreds of pages.
https://oregondigital.org/catalog?f%5Bnon_user_collections_ssim%5D%5B%5D=general-catalogs&locale=en&search_field=all_fields
These searches can take a minute or more to load. This is from fileset documents taking 1-5sec to load for each work returned.
This could probably be done by preventing the all text field from being returned unless we actually need it. Alternatively, I'd be ok with just getting the search results to speed up. We could look at why solr docs are pulled for filesets on searches and see if there's a way to work around it.
Expected behavior
Solr Documents for FileSets with large text content load consistently under 1sec.
Related work
https://issues.apache.org/jira/browse/SOLR-3191 An issue for excluding a field from solr search results exists but it seems abandoned
Accessibility Concerns
N/A