CDLUC3 / ezid

CDLUC3 ezid
MIT License
11 stars 4 forks source link

Test OpenSearch with EZID's data scale #595

Closed sfisher closed 2 months ago

sfisher commented 7 months ago

The performance of OpenSearch may be different with the larger number of records in EZID (33 million?) vs a smaller test set for development.

I believe the stage server has more records for testing scale. this ticket needs more fleshing out, but . . .

adambuttrick commented 3 months ago

Fleshing out more complete criteria:

Test Type

Performance and Functional

Describe the functionality to be tested

We need to verify that OpenSearch is responsive and functions correctly when handling EZID's full record set.

Describe the test scenario

  1. Perform a series of search queries of varying complexity on the full EZID dataset:

Test keyword parameters

keywords={single word}

Tested "California": Opensearch Stage returns in a couple of seconds and get Gateway timeout on production (from database)

Tested "Cat" -- stage returns in about a second, production in 6 seconds. about 7000 results, (Stage has fewer records)

keywords={space separated multiple words}

Tested "cat parasite" and In OpenSearch 6 results and in database 9. However 2 results were after the time frame of when the stage database had records transferred to it. The one result that existed and didn't return in opensearch had words like "category" and "cat.1" in it, but not "cat" by itself. This appears to be a difference in how OpenSearch tokenizes words from how the database search works, so we think this is acceptable.

keywords={test AND and OR queries}

AND queries are the default and work fine.

TODO: It seems like the OR search doesn't work currently, this is something to fix.

keywords={common keywords/stop words}

OpenSearch allows searches and doesn't have a problem with the current stopwords for the database like "and" "ark" "doi" and others. It just returns a lot of results but returns almost instantly.

Having this many results may not be particularly useful, but doesn't cause performance degradation or problems so we believe there is no compelling reason to re-implement the stop words used for the database search.

Test identifier parameter:

identifier={specific_identifier}
identifier={partial identifier}

Search by identifier and prefix identifier worked fine in every case we tried.

When searching by shoulder OS returned more than 10,000 results and gave a higher count.

TODO: If searching and more than 10,000 results then OS is configured to stop at that number, so paging through more than that doesn't work. We could add something to the paging to limit to this also, if we wanted.

We didn't think it was generally necessary to return more results than 10,000 from an individual query.

Test title parameter

title={exact_title}

Tested with a long title and both system only returned one result (sorry, forgot to record the title we put in)

title={partial_title}

Tested with "Carbohydrates in an acidic multivalent assembly" and also only got 1 result in both systems.

Very similar results when searching for "Carbohydrate" in title

Test creator parameter:

creator={exact_creator_name}
creator={partial_creator_name}

Tested "Tracy Seneca" and "Seneca, Tracy" and just "Seneca" and all returned the same number of results (for the full name).

Test publisher parameter

publisher={exact_publisher_name}
publisher={partial_publisher_name}

We tested Nature and "Springer Nature" and got similar results although there were difference in the database of production vs stage. We spot checked some that were missing and they didn't exist in the stage database, so this looks good.

Test pubyear_from and pubyear_to parameters

pubyear_from={year}
pubyear_to={year}
pubyear_from={start_year}&pubyear_to={end_year}

pubyear=2020 (and Open Context publisher to limit to reasonable results size) -- Gave 29 results on both systems.

We went through this result set "resourcePublisher:(Open Context) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2021" and checked the missing items in Stage and they didn't exist in the database for stage, either so results are functionally equivalent. Some sorting of foreign characters was different between the two systems.

Test object_type parameter

object_type={specific_type}

We tested resourceType (object type) with this query: resourceType:(ConferencePaper) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2024

The current database returns additional records where resource_type is set to 'ConferencePaper" rather than resource_type_general only being set to it.

TODO: We can add a way to do an OR for that in either field. The current system uses ResourceType.General only.

Test id_type parameter:

id_type={ark}
id_type={doi}
id_type={uuid} -- we do not have UUIDs

DOIs had matching number for 2014

ARKs were more complicated between the two systems.

our query: identifierType:(ark) AND resourceTitle:(Library) AND resourcePublicationYear:2014

Two titles were missing from stage "Qualitative" and "Working from a temporary Salon" and production was missing "Greening the mothership" record.

I'm able to find the "salon" one by searching for "Library's" instead of "library" It seems to consider the possessive form of the word a different token to search for in OpenSearch vs what the database search returns.

Greening the mothership -- stage has in search index but maybe a problem with the search table?

Qualitative -- https://ezid.cdlib.org/id/ark:/13030/qt2jj6p63d

We believe some of this was caused by some newline or problem characters in the "metadata" field in the database and some items may need to have it massaged or fixed for records to show correctly (also on the item record page where it shows metadata).

Jing may be able to give more insight to these cases, but it doesn't seem as though it's a large systemic problem with search.

Test complex queries combining multiple parameters

We did some of the complex queries along the way just to get results that were manageable. We didn't record query times in all cases, but found that OpenSearch returned in about a second or less, even for items with a huge number of results where the database search often took a 3-5 seconds for items with only a small number of results to larger sets (over about 5,000 or 10,000 results) often taking more than a minute, timing out, returning gateway errors and creating large amounts of disk swapping and reduced memory headroom on the RDS server.

 ```
 title={title}&creator={creator}&pubyear_from={year}&pubyear_to={year}&object_type={type}&id_type={type}
 ```
  1. For each query:

    • [ ] Record response times
    • [ ] Compare response times to comparable searches in existing DB-based search, excluding known cases where query would degrade the service
    • [ ] Verify accuracy and completeness of results relative to the same DB-based search
    • [ ] Verify proper handling of pagination and large result sets
    • [ ] Test concurrency with multiple simultaneous searches
  2. Conduct tests under load:

    • [ ] Simulate peak usage conditions (to the extent possible)
    • [ ] Monitor system resources
  3. Test edge cases:

    • [x] Queries returning very large result sets
    • [x] Queries with no results
    • [ ] Malformed or invalid queries

Expected outcome

jsjiang commented 3 months ago

@adambuttrick Hi Adam, The search parameters listed in test scenarios are for the batch download API. The parameters for the search request includes

Here is a sample query:

search?filtered=t&keywords=&identifier=doi%3A10.7270%2FQ2&title=&creator=&publisher=&pubyear_from=&pubyear_to=&object_type=&id_type=

There is no change to the batch download API.

adambuttrick commented 3 months ago

@jsjiang Thanks for catching this! I've revised. Let me know if all looks good.

jsjiang commented 3 months ago

Just added "Test keywords parameter" to the list.

adambuttrick commented 3 months ago

Requirement on https://github.com/CDLUC3/ezid/issues/653

jsjiang commented 3 months ago

Jing's test results:

Public search interface (without login)

jsjiang commented 3 months ago

Jing's test results:

Aug 15, ezid-stg on commit f79af50b8a7c996d8e541a5f8ddf05ba85d7c845 Public search interface (with login as apitest):

jsjiang commented 3 months ago

Here is the full list of identifiers that showed in the download all report but not on the UI for apitest account:

  1. ark:/99999/fk4184tn6s
  2. ark:/99999/fk4281vj6w
  3. ark:/99999/fk43b7kv04
  4. ark:/99999/fk43j50m34
  5. ark:/99999/fk4515jf90
  6. ark:/99999/fk4612kc1f
  7. ark:/99999/fk4795qd8z
  8. ark:/99999/fk48s6883p
  9. ark:/99999/fk49s39549
  10. ark:/99999/fk4c26f73h
  11. ark:/99999/fk4dj7029f
  12. ark:/99999/fk4fj40z93
  13. ark:/99999/fk4fn2qb94
  14. ark:/99999/fk4j97pw5n
  15. ark:/99999/fk4k94qs26
  16. ark:/99999/fk4kd3f535
  17. ark:/99999/fk4n31cs6z
  18. ark:/99999/fk4p28dp8v
  19. ark:/99999/fk4pz6r64j
  20. ark:/99999/fk4q544z8z
  21. ark:/99999/fk4rv23m15
  22. ark:/99999/fk4sr0f39x
  23. ark:/99999/fk4tx4vs69
  24. ark:/99999/fk4wh4412b
  25. ark:/99999/fk4zp5kk9h
  26. doi:10.15697/FK2107G
  27. doi:10.15697/FK2207T 
  28. doi:10.15697/FK24P9X
  29. doi:10.15697/FK25Q0X
  30. doi:10.15697/FK28H2H
  31. doi:10.15697/FK29H2V
  32. doi:10.15697/FK2D65Q
  33. doi:10.15697/FK2F652
  34. doi:10.15697/FK2J07N
  35. doi:10.15697/FK2K06K
  36. doi:10.15697/FK2NQ0R
  37. doi:10.15697/FK2PQ03
  38. doi:10.15697/FK2RH2B
  39. doi:10.15697/FK2SH33
  40. doi:10.15697/FK2W948
  41. doi:10.15697/FK2X65W
  42. doi:10.5072/FK2086D79C
  43. doi:10.5072/FK2154QR91
  44. doi:10.5072/FK21G0TX60
  45. doi:10.5072/FK2320355K
  46. doi:10.5072/FK23R10W7G1
  47. doi:10.5072/FK23X8DP3M
  48. doi:10.5072/FK24X5FK26
  49. doi:10.5072/FK2571JR1J
  50. doi:10.5072/FK26Q23K6Z
  51. doi:10.5072/FK27P94G7B
  52. doi:10.5072/FK29028J58
  53. doi:10.5072/FK2BG2TD13
  54. doi:10.5072/FK2G73J649
  55. doi:10.5072/FK2H70K33W
  56. doi:10.5072/FK2KH0R22V
  57. doi:10.5072/FK2M04809J
  58. doi:10.5072/FK2N018W96
  59. doi:10.5072/FK2N87D27W
  60. doi:10.5072/FK2Q52RG43
  61. doi:10.5072/FK2QR4ZT3C
  62. doi:10.5072/FK2RR20Q1V
  63. doi:10.5072/FK2S183W1G
  64. doi:10.5072/FK2V69KF5J
  65. doi:10.5072/FK2VH5PM6K
  66. doi:10.5072/FK2WH2QH6P
  67. doi:10.5072/FK2WS8TP4X
  68. doi:10.5072/FK2Z89CJ11
adambuttrick commented 2 months ago

Could we also login to the stage as admin and test with some of the data in the user account views (i.e. that do not contain test identifiers) to verify?

jsjiang commented 2 months ago

Testing UI with login merritt:

 wc -l B7XxQFxqXe6ElG45.csv
 3116206 B7XxQFxqXe6ElG45.csv
jsjiang commented 2 months ago

Checked records counts on 8/16:

Stage:

Performed the same queries on PRD, the records counts in the ezidapp_identifier and ezidapp_searchidentifier tables are the same for owner=2 (apitest) and owner=124 (merritt):

# 4084152
select count(id) from ezidapp_identifier where owner_id = 124 ;
# 4084152
select count(id) from ezidapp_searchidentifier where owner_id = 124;
# 1860
select count(id) from ezidapp_identifier where owner_id = 2 ;
# 1860
select count(id) from ezidapp_searchidentifier where owner_id = 2;

So the differences on stage is caused by inconsistent data in the stage environment.

jsjiang commented 2 months ago

Jing retested following items 8/23 on ezid-stg with commit ed6641b76b12c413e2520f650dfe1d7a8df8b885:

adambuttrick commented 2 months ago

Tests passed as part of release https://github.com/CDLUC3/ezid/releases/tag/v3.2.19