SolrDocuments with large full text fields load slowly

CGillen commented 1 year ago

Descriptive summary

Search results for works with large (h)OCR/extracted text load unacceptably slow. This is best seen when searching the OSU General Catalog collection. These works are dense articles with lots of text per page and hundreds of pages.

https://oregondigital.org/catalog?f%5Bnon_user_collections_ssim%5D%5B%5D=general-catalogs&locale=en&search_field=all_fields

These searches can take a minute or more to load. This is from fileset documents taking 1-5sec to load for each work returned.

This could probably be done by preventing the all text field from being returned unless we actually need it. Alternatively, I'd be ok with just getting the search results to speed up. We could look at why solr docs are pulled for filesets on searches and see if there's a way to work around it.

Expected behavior

Solr Documents for FileSets with large text content load consistently under 1sec.

Related work

https://issues.apache.org/jira/browse/SOLR-3191 An issue for excluding a field from solr search results exists but it seems abandoned

Accessibility Concerns

N/A

CGillen commented 1 year ago

It's still kind of bad, but it seems like it's improved for some reason. I'm guessing it's some of the work @wickr did on #2885. I don't think we can really get any better without removing full text hit highlighting, and even then I'm not sure how much better it would be.

shieldsb commented 1 year ago

I think it seems faster, too

petersec commented 1 year ago

@CGillen @shieldsb

From my test this morning --

it took 42 seconds to open the collection (using the link at top) in OD2
once opened, I conducted a search for pauling; results were displayed after 82 seconds
the same processes took 3 and 5 seconds in OD1

Given these results, I would suggest reopening this ticket for now, pending further conversation.

shieldsb commented 1 year ago

@petersec @CGillen The link in Chris' comment above is on production. Is there a similar test for staging?

petersec commented 1 year ago

@shieldsb @CGillen Presently, I don't think so. The General Catalogs collection on staging doesn't have any public objects in it.

shieldsb commented 1 year ago

@petersec @CGillen We are going to deploy to production on Monday, so let's leave this in QA until then and retest it after.

petersec commented 1 year ago

@shieldsb @CGillen I ran the same test today as was run on July 24:

it took 56 seconds to open the collection (using the link at top) in OD2 production
once opened, I conducted a search for pauling; after five minutes, I received a 504 Gateway Time-out message. No search results were returned.

CGillen commented 1 year ago

My solution to this will require a SUBSTANTIAL change to the system. We'd have to do a reindex & generate new derivatives for all content with OCR/extracted content. More than that, we'd also lose the ability to hit-highlight the OCR/extracted content within search results. Full text searching would be preserved.

Alternatively, we could skip OCR/text extraction altogether for these large PDFs and preserve it for the smaller ones. This would be a MUCH smaller change and could be done automatically based on a set length limit. This might lead to not being able to full text search the PDFs we skip, more testing would be necessary. Hit-highlighting would be preserved for smaller documents

Finally, we might look at updating Tesseract and re-doing ocr in hopes that it will cleanup and shrink the results a little, plus do some automatic cleanup, and maybe even ask for some manual cleanup. This would preserve both hit-highlighting and full text search, but might not actually accomplish all that much, especially compared to the first two options.

@jsimic @shieldsb This is going to need a discussion w/ POSM

CGillen commented 1 year ago

Note for myself: this branch (feature/ocrRewrite) is the beginning PoC to rewriting hOCR text as a derivative. It could use a small tweak to make sure hocr_text_tsimv is still used until the complete reindex/derivative generation, but it's pretty much ready for hOCR content, extracted text would just need to be implemented

petersec commented 1 year ago

I ran the timing test again this morning, and the results are definitely better:

30 seconds to open the collection (using the link at top) in OD2 production. This is the fastest time documented so far.
once opened, I conducted a search for pauling; results were displayed after 87 seconds. This time is similar to what was recorded on July 24. Still not ideal, but certainly better than the August 9 test [5-minute Gateway timeout]

CGillen commented 7 months ago

QA: This does not improve speed yet. That will come in the next PR. When QA'd move back to in-progress.

Check several PDF works:

[ ] Can search via body text
[ ] FileSets still attached/shown
[ ] Can search in viewer
[ ] viewer search is reasonably accurate

petersec commented 7 months ago

QA pass. Site-wide search and UV search are working as expected on Staging.

@shieldsb - can you move this ticket back to In Progress, per @CGillen's note?

straleyb commented 1 month ago

The sidekiq job finished on staging. The comment on April 24th can be used as QA for this. This looks like it should be good to go if those requirements pass.

QA: This does not improve speed yet. That will come in the next PR. When QA'd move back to in-progress.

Check several PDF works:

[x] Can search via body text
[x] FileSets still attached/shown
[x] Can search in viewer
[x] viewer search is reasonably accurate

straleyb commented 1 month ago

Side note. Searching in solr took 100ms rather than 5 seconds per item. So a markedly faster search.

straleyb commented 1 month ago

Took 1.1 minutes to search for an in text match for a 700 page pdf. Same for 1 word and 3 words.

KevinJonesMeta commented 1 month ago

QA pass. Searches were consistent with @straleyb 's comments above across large pdfs.

OregonDigital / OD2