Closed sfisher closed 2 months ago
Fleshing out more complete criteria:
Performance and Functional
We need to verify that OpenSearch is responsive and functions correctly when handling EZID's full record set.
keywords={single word}
Tested "California": Opensearch Stage returns in a couple of seconds and get Gateway timeout on production (from database)
Tested "Cat" -- stage returns in about a second, production in 6 seconds. about 7000 results, (Stage has fewer records)
keywords={space separated multiple words}
Tested "cat parasite" and In OpenSearch 6 results and in database 9. However 2 results were after the time frame of when the stage database had records transferred to it. The one result that existed and didn't return in opensearch had words like "category" and "cat.1" in it, but not "cat" by itself. This appears to be a difference in how OpenSearch tokenizes words from how the database search works, so we think this is acceptable.
keywords={test AND and OR queries}
AND queries are the default and work fine.
TODO: It seems like the OR search doesn't work currently, this is something to fix.
keywords={common keywords/stop words}
OpenSearch allows searches and doesn't have a problem with the current stopwords for the database like "and" "ark" "doi" and others. It just returns a lot of results but returns almost instantly.
Having this many results may not be particularly useful, but doesn't cause performance degradation or problems so we believe there is no compelling reason to re-implement the stop words used for the database search.
identifier={specific_identifier}
identifier={partial identifier}
Search by identifier and prefix identifier worked fine in every case we tried.
When searching by shoulder OS returned more than 10,000 results and gave a higher count.
TODO: If searching and more than 10,000 results then OS is configured to stop at that number, so paging through more than that doesn't work. We could add something to the paging to limit to this also, if we wanted.
We didn't think it was generally necessary to return more results than 10,000 from an individual query.
title={exact_title}
Tested with a long title and both system only returned one result (sorry, forgot to record the title we put in)
title={partial_title}
Tested with "Carbohydrates in an acidic multivalent assembly" and also only got 1 result in both systems.
Very similar results when searching for "Carbohydrate" in title
creator={exact_creator_name}
creator={partial_creator_name}
Tested "Tracy Seneca" and "Seneca, Tracy" and just "Seneca" and all returned the same number of results (for the full name).
publisher={exact_publisher_name}
publisher={partial_publisher_name}
We tested Nature and "Springer Nature" and got similar results although there were difference in the database of production vs stage. We spot checked some that were missing and they didn't exist in the stage database, so this looks good.
pubyear_from={year}
pubyear_to={year}
pubyear_from={start_year}&pubyear_to={end_year}
pubyear=2020 (and Open Context publisher to limit to reasonable results size) -- Gave 29 results on both systems.
We went through this result set "resourcePublisher:(Open Context) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2021"
and checked the missing items in Stage and they didn't exist in the
database for stage, either so results are functionally equivalent. Some sorting of foreign characters was different between the two systems.
object_type={specific_type}
We tested resourceType (object type) with this query: resourceType:(ConferencePaper) AND resourcePublicationYear:>=2020 AND resourcePublicationYear:<=2024
The current database returns additional records where resource_type is set to 'ConferencePaper" rather than resource_type_general only being set to it.
TODO: We can add a way to do an OR for that in either field. The current system uses ResourceType.General only.
id_type={ark}
id_type={doi}
id_type={uuid} -- we do not have UUIDs
DOIs had matching number for 2014
ARKs were more complicated between the two systems.
our query: identifierType:(ark) AND resourceTitle:(Library) AND resourcePublicationYear:2014
Two titles were missing from stage "Qualitative" and "Working from a temporary Salon" and production was missing "Greening the mothership" record.
I'm able to find the "salon" one by searching for "Library's" instead of "library" It seems to consider the possessive form of the word a different token to search for in OpenSearch vs what the database search returns.
Greening the mothership -- stage has in search index but maybe a problem with the search table?
Qualitative -- https://ezid.cdlib.org/id/ark:/13030/qt2jj6p63d
We believe some of this was caused by some newline or problem characters in the "metadata" field in the database and some items may need to have it massaged or fixed for records to show correctly (also on the item record page where it shows metadata).
Jing may be able to give more insight to these cases, but it doesn't seem as though it's a large systemic problem with search.
We did some of the complex queries along the way just to get results that were manageable. We didn't record query times in all cases, but found that OpenSearch returned in about a second or less, even for items with a huge number of results where the database search often took a 3-5 seconds for items with only a small number of results to larger sets (over about 5,000 or 10,000 results) often taking more than a minute, timing out, returning gateway errors and creating large amounts of disk swapping and reduced memory headroom on the RDS server.
```
title={title}&creator={creator}&pubyear_from={year}&pubyear_to={year}&object_type={type}&id_type={type}
```
For each query:
Conduct tests under load:
Test edge cases:
@adambuttrick Hi Adam, The search parameters listed in test scenarios are for the batch download API. The parameters for the search request includes
Here is a sample query:
search?filtered=t&keywords=&identifier=doi%3A10.7270%2FQ2&title=&creator=&publisher=&pubyear_from=&pubyear_to=&object_type=&id_type=
There is no change to the batch download API.
@jsjiang Thanks for catching this! I've revised. Let me know if all looks good.
Just added "Test keywords parameter" to the list.
Requirement on https://github.com/CDLUC3/ezid/issues/653
Jing's test results:
f79af50b8a7c996d8e541a5f8ddf05ba85d7c845
Public search interface (without login)
last page: changed to last page after warning
last page: got Page unavailable error
Jing's test results:
Aug 15, ezid-stg on commit f79af50b8a7c996d8e541a5f8ddf05ba85d7c845 Public search interface (with login as apitest):
Mange IDS page: returned 98 Identifiers
Download all: created downloadable file S0VxH9CbVcEOZAuq.zip
. This file contains 166 identifiers.
Note: The number of identifiers in the report does not match to what shows in the Manage IDs
page (166 vs 98)
select * from ezidapp_searchidentifier where owner_id = 2;
returned 166 recordsezid-app-search-stg
by owner=2 returned 98 recordsView identifier got No such identifier: doi:10.31223/FK3TEST_1.
error.
Note: test identifiers are deleted regularly from the Search Identifier
table through the Delete action of the SearchIndexerQueue
This is most likely a data issue. This identifier is in the search identifier table but not in the identifier and refidentifier table
Here is the full list of identifiers that showed in the download all report but not on the UI for apitest account:
Could we also login to the stage as admin and test with some of the data in the user account views (i.e. that do not contain test identifiers) to verify?
Testing UI with login merritt
:
ezidapp_identifier
and ezidapp_searchidentifier
for merritt
(owner_id=124) all returned 3,116,205 wc -l B7XxQFxqXe6ElG45.csv
3116206 B7XxQFxqXe6ElG45.csv
Checked records counts on 8/16:
Stage:
ezidapp_identifier
table for owner=2: 92.ezidapp_searchidentifier
table for owner=2: 158Performed the same queries on PRD, the records counts in the ezidapp_identifier
and ezidapp_searchidentifier
tables are the same for owner=2 (apitest) and owner=124 (merritt):
# 4084152
select count(id) from ezidapp_identifier where owner_id = 124 ;
# 4084152
select count(id) from ezidapp_searchidentifier where owner_id = 124;
# 1860
select count(id) from ezidapp_identifier where owner_id = 2 ;
# 1860
select count(id) from ezidapp_searchidentifier where owner_id = 2;
So the differences on stage is caused by inconsistent data in the stage environment.
Jing retested following items 8/23 on ezid-stg with commit ed6641b76b12c413e2520f650dfe1d7a8df8b885
:
Tests passed as part of release https://github.com/CDLUC3/ezid/releases/tag/v3.2.19
The performance of OpenSearch may be different with the larger number of records in EZID (33 million?) vs a smaller test set for development.
I believe the stage server has more records for testing scale. this ticket needs more fleshing out, but . . .