Test OpenSearch with EZID's data scale

sfisher commented 7 months ago

The performance of OpenSearch may be different with the larger number of records in EZID (33 million?) vs a smaller test set for development.

I believe the stage server has more records for testing scale. this ticket needs more fleshing out, but . . .

Is search responsive
How long does it take to regenerate search data using the scripts?
Works as expected at scale?

adambuttrick commented 3 months ago

Fleshing out more complete criteria:

Test Type

Performance and Functional

Describe the functionality to be tested

We need to verify that OpenSearch is responsive and functions correctly when handling EZID's full record set.

Describe the test scenario

Perform a series of search queries of varying complexity on the full EZID dataset:

Test keyword parameters

keywords={single word}

Tested "California": Opensearch Stage returns in a couple of seconds and get Gateway timeout on production (from database)

Tested "Cat" -- stage returns in about a second, production in 6 seconds. about 7000 results, (Stage has fewer records)

keywords={space separated multiple words}

Tested "cat parasite" and In OpenSearch 6 results and in database 9. However 2 results were after the time frame of when the stage database had records transferred to it. The one result that existed and didn't return in opensearch had words like "category" and "cat.1" in it, but not "cat" by itself. This appears to be a difference in how OpenSearch tokenizes words from how the database search works, so we think this is acceptable.

keywords={test AND and OR queries}

AND queries are the default and work fine.

TODO: It seems like the OR search doesn't work currently, this is something to fix.

keywords={common keywords/stop words}

OpenSearch allows searches and doesn't have a problem with the current stopwords for the database like "and" "ark" "doi" and others. It just returns a lot of results but returns almost instantly.

Having this many results may not be particularly useful, but doesn't cause performance degradation or problems so we believe there is no compelling reason to re-implement the stop words used for the database search.

Test identifier parameter:

identifier={specific_identifier}
identifier={partial identifier}

Search by identifier and prefix identifier worked fine in every case we tried.

When searching by shoulder OS returned more than 10,000 results and gave a higher count.

TODO: If searching and more than 10,000 results then OS is configured to stop at that number, so paging through more than that doesn't work. We could add something to the paging to limit to this also, if we wanted.

We didn't think it was generally necessary to return more results than 10,000 from an individual query.

Test title parameter

title={exact_title}

Tested with a long title and both system only returned one result (sorry, forgot to record the title we put in)

title={partial_title}

Tested with "Carbohydrates in an acidic multivalent assembly" and also only got 1 result in both systems.

Very similar results when searching for "Carbohydrate" in title

Test creator parameter:

creator={exact_creator_name}
creator={partial_creator_name}

Tested "Tracy Seneca" and "Seneca, Tracy" and just "Seneca" and all returned the same number of results (for the full name).

Test publisher parameter

publisher={exact_publisher_name}
publisher={partial_publisher_name}

We tested Nature and "Springer Nature" and got similar results although there were difference in the database of production vs stage. We spot checked some that were missing and they didn't exist in the stage database, so this looks good.

Test pubyear_from and pubyear_to parameters

pubyear_from={year}
pubyear_to={year}
pubyear_from={start_year}&pubyear_to={end_year}

pubyear=2020 (and Open Context publisher to limit to reasonable results size) -- Gave 29 results on both systems.

We went through this result set "resourcePublisher:(Open Context) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2021" and checked the missing items in Stage and they didn't exist in the database for stage, either so results are functionally equivalent. Some sorting of foreign characters was different between the two systems.

Test object_type parameter

object_type={specific_type}

We tested resourceType (object type) with this query: resourceType:(ConferencePaper) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2024

The current database returns additional records where resource_type is set to 'ConferencePaper" rather than resource_type_general only being set to it.

TODO: We can add a way to do an OR for that in either field. The current system uses ResourceType.General only.

Test id_type parameter:

id_type={ark}
id_type={doi}
id_type={uuid} -- we do not have UUIDs

DOIs had matching number for 2014

ARKs were more complicated between the two systems.

our query: identifierType:(ark) AND resourceTitle:(Library) AND resourcePublicationYear:2014

Two titles were missing from stage "Qualitative" and "Working from a temporary Salon" and production was missing "Greening the mothership" record.

I'm able to find the "salon" one by searching for "Library's" instead of "library" It seems to consider the possessive form of the word a different token to search for in OpenSearch vs what the database search returns.

Greening the mothership -- stage has in search index but maybe a problem with the search table?

Qualitative -- https://ezid.cdlib.org/id/ark:/13030/qt2jj6p63d

We believe some of this was caused by some newline or problem characters in the "metadata" field in the database and some items may need to have it massaged or fixed for records to show correctly (also on the item record page where it shows metadata).

Jing may be able to give more insight to these cases, but it doesn't seem as though it's a large systemic problem with search.

Test complex queries combining multiple parameters

We did some of the complex queries along the way just to get results that were manageable. We didn't record query times in all cases, but found that OpenSearch returned in about a second or less, even for items with a huge number of results where the database search often took a 3-5 seconds for items with only a small number of results to larger sets (over about 5,000 or 10,000 results) often taking more than a minute, timing out, returning gateway errors and creating large amounts of disk swapping and reduced memory headroom on the RDS server.

 ```
 title={title}&creator={creator}&pubyear_from={year}&pubyear_to={year}&object_type={type}&id_type={type}
 ```

For each query:
- [ ] Record response times
- [ ] Compare response times to comparable searches in existing DB-based search, excluding known cases where query would degrade the service
- [ ] Verify accuracy and completeness of results relative to the same DB-based search
- [ ] Verify proper handling of pagination and large result sets
- [ ] Test concurrency with multiple simultaneous searches
Conduct tests under load:
- [ ] Simulate peak usage conditions (to the extent possible)
- [ ] Monitor system resources
Test edge cases:
- [x] Queries returning very large result sets
- [x] Queries with no results
- [ ] Malformed or invalid queries

Expected outcome

[ ] All search queries return results with expected/acceptable response times
[ ] No significant degradation compared to DB-based search
[ ] All search functionality works correctly with the full dataset
[ ] Performance remains stable under load and with complex queries
[ ] Search results are accurate and complete for all query types
[ ] No unexpected behaviors or errors

jsjiang commented 3 months ago

@adambuttrick Hi Adam, The search parameters listed in test scenarios are for the batch download API. The parameters for the search request includes

keywords
identifier
title
creator
publisher
pubyear_from
pubyear_to
object_type
id_type

Here is a sample query:

search?filtered=t&keywords=&identifier=doi%3A10.7270%2FQ2&title=&creator=&publisher=&pubyear_from=&pubyear_to=&object_type=&id_type=

There is no change to the batch download API.

adambuttrick commented 3 months ago

@jsjiang Thanks for catching this! I've revised. Let me know if all looks good.

jsjiang commented 3 months ago

Just added "Test keywords parameter" to the list.

adambuttrick commented 3 months ago

Requirement on https://github.com/CDLUC3/ezid/issues/653

jsjiang commented 3 months ago

Jing's test results:

Aug 14, ezid-stg on commit f79af50b8a7c996d8e541a5f8ddf05ba85d7c845

Public search interface (without login)

Keywords search:
- keywords:(California Digital Library): returned 558 Search Results in 56 pages each with 10 entries;
- page size change: changed to 50, 100 and worked OK
- page navigation in page range: worked OK
- page navigation out of page range:
  - <1: changed to 1 after a warning
  - last page: changed to last page after warning
- keywords:(California AND Digital AND Library): same results as above.
- keywords:(California OR Digital OR Library): got Page unavailable error
- keywords:(Awesome OR Cat): returned 6998 Search Results
search identifier
- identifier:(ark:/88122/sqwj0083): returned one record
- identifier:(ark:/88122): returned 15194052 Search Results records; UI only shows the first 10K records.
- page navigation out of page range:
  - <1: changed to 1 after a warning
  - last page: got Page unavailable error
resourceTitle:(CAMEL Tobacco): returned 6629 records
resourceCreator:(Reynolds): returned 1692244 Search Results
resourcePublisher:(nature): returned 55 Search Results
resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2022: returned 388433 Search Results
resourceType:(Journal): returned 22450 Search Results
identifierType:(doi): returned 319044 Search Results
identifierType:(ark): 23907595 Search Results
identifierType:(ark) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2023: 331737 Search Results
identifierType:(ark) AND resourceType:(Collection) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2023: 5547 Search Results
identifierType:(ark) AND resourceCreator:(Evers) AND resourceType:(Collection) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=202: 100 results
identifierType:(ark) AND resourceTitle:(body) AND resourceCreator:(Evers) AND resourceType:(Collection) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2023: 11 results
keywords:(California) AND identifierType:(ark) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2023: 1690 search results
Customize view: worked OK

jsjiang commented 3 months ago

Jing's test results:

Aug 15, ezid-stg on commit f79af50b8a7c996d8e541a5f8ddf05ba85d7c845 Public search interface (with login as apitest):

Mange IDS page: returned 98 Identifiers
Download all: created downloadable file S0VxH9CbVcEOZAuq.zip. This file contains 166 identifiers. Note: The number of identifiers in the report does not match to what shows in the Manage IDs page (166 vs 98)
- select * from ezidapp_searchidentifier where owner_id = 2; returned 166 records
- search OpenSearch index ezid-app-search-stg by owner=2 returned 98 records
View identifier got No such identifier: doi:10.31223/FK3TEST_1. error. Note: test identifiers are deleted regularly from the Search Identifier table through the Delete action of the SearchIndexerQueue
This is most likely a data issue. This identifier is in the search identifier table but not in the identifier and refidentifier table

jsjiang commented 3 months ago

Here is the full list of identifiers that showed in the download all report but not on the UI for apitest account:

ark:/99999/fk4184tn6s
ark:/99999/fk4281vj6w
ark:/99999/fk43b7kv04
ark:/99999/fk43j50m34
ark:/99999/fk4515jf90
ark:/99999/fk4612kc1f
ark:/99999/fk4795qd8z
ark:/99999/fk48s6883p
ark:/99999/fk49s39549
ark:/99999/fk4c26f73h
ark:/99999/fk4dj7029f
ark:/99999/fk4fj40z93
ark:/99999/fk4fn2qb94
ark:/99999/fk4j97pw5n
ark:/99999/fk4k94qs26
ark:/99999/fk4kd3f535
ark:/99999/fk4n31cs6z
ark:/99999/fk4p28dp8v
ark:/99999/fk4pz6r64j
ark:/99999/fk4q544z8z
ark:/99999/fk4rv23m15
ark:/99999/fk4sr0f39x
ark:/99999/fk4tx4vs69
ark:/99999/fk4wh4412b
ark:/99999/fk4zp5kk9h
doi:10.15697/FK2107G
doi:10.15697/FK2207T
doi:10.15697/FK24P9X
doi:10.15697/FK25Q0X
doi:10.15697/FK28H2H
doi:10.15697/FK29H2V
doi:10.15697/FK2D65Q
doi:10.15697/FK2F652
doi:10.15697/FK2J07N
doi:10.15697/FK2K06K
doi:10.15697/FK2NQ0R
doi:10.15697/FK2PQ03
doi:10.15697/FK2RH2B
doi:10.15697/FK2SH33
doi:10.15697/FK2W948
doi:10.15697/FK2X65W
doi:10.5072/FK2086D79C
doi:10.5072/FK2154QR91
doi:10.5072/FK21G0TX60
doi:10.5072/FK2320355K
doi:10.5072/FK23R10W7G1
doi:10.5072/FK23X8DP3M
doi:10.5072/FK24X5FK26
doi:10.5072/FK2571JR1J
doi:10.5072/FK26Q23K6Z
doi:10.5072/FK27P94G7B
doi:10.5072/FK29028J58
doi:10.5072/FK2BG2TD13
doi:10.5072/FK2G73J649
doi:10.5072/FK2H70K33W
doi:10.5072/FK2KH0R22V
doi:10.5072/FK2M04809J
doi:10.5072/FK2N018W96
doi:10.5072/FK2N87D27W
doi:10.5072/FK2Q52RG43
doi:10.5072/FK2QR4ZT3C
doi:10.5072/FK2RR20Q1V
doi:10.5072/FK2S183W1G
doi:10.5072/FK2V69KF5J
doi:10.5072/FK2VH5PM6K
doi:10.5072/FK2WH2QH6P
doi:10.5072/FK2WS8TP4X
doi:10.5072/FK2Z89CJ11

adambuttrick commented 2 months ago

Could we also login to the stage as admin and test with some of the data in the user account views (i.e. that do not contain test identifiers) to verify?

jsjiang commented 2 months ago

Testing UI with login merritt:

Manage IDs page: 3,116,205
Download all: https://ezid-stg.cdlib.org/s3_download/B7XxQFxqXe6ElG45.zip
- It took a long time for EZID to generate this file
- report contains 3116206 lines with a header line
RDS: select count from ezidapp_identifier and ezidapp_searchidentifier for merritt (owner_id=124) all returned 3,116,205

 wc -l B7XxQFxqXe6ElG45.csv
 3116206 B7XxQFxqXe6ElG45.csv

jsjiang commented 2 months ago

Checked records counts on 8/16:

Stage:

records count in the ezidapp_identifier table for owner=2: 92.
record counts in the ezidapp_searchidentifier table for owner=2: 158

Performed the same queries on PRD, the records counts in the ezidapp_identifier and ezidapp_searchidentifier tables are the same for owner=2 (apitest) and owner=124 (merritt):

# 4084152
select count(id) from ezidapp_identifier where owner_id = 124 ;
# 4084152
select count(id) from ezidapp_searchidentifier where owner_id = 124;
# 1860
select count(id) from ezidapp_identifier where owner_id = 2 ;
# 1860
select count(id) from ezidapp_searchidentifier where owner_id = 2;

So the differences on stage is caused by inconsistent data in the stage environment.

jsjiang commented 2 months ago

Jing retested following items 8/23 on ezid-stg with commit ed6641b76b12c413e2520f650dfe1d7a8df8b885:

keywords:(California OR Digital OR Library): returned 1222514 results without error - this is expected
Out of page range navigation: show warning, then direct to the first or last page - this is expected.

adambuttrick commented 2 months ago

Tests passed as part of release https://github.com/CDLUC3/ezid/releases/tag/v3.2.19

CDLUC3 / ezid