WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
250 stars 202 forks source link

Add tests for edge cases in the `/related` endpoint #3188

Open krysal opened 1 year ago

krysal commented 1 year ago

Current Situation

In #3177 and #3181, we issued quick hotfixes to solve errors when trying to access related works from a media without a defined title or creator. These could have been caught with automated tests.

Suggested Improvement

We want to include some unit tests to prevent this situation from happening in the future, just in case.

Benefit

Prevent more errors when calling for the /related endpoint.

Additional context

Mentioned links in the beginning. Check those to see the reproduction steps before the issues were fixed.

obulat commented 1 year ago

The problem we had in those cases was that when the value for some field is None, then the Elasticsearch document _source does not have this property at all. The Hit object uses the _source dictionary to set its properies, and throws an error when we try to access a property that does not exist.

For instance, when the title is None in the database, the Elasticsearch document's _source dictionary does not have a title key, so hit.title throws an error.

So, I think there are 2 ways of solving this:

  1. Make sure that all the Elasticsearch documents have all the properties on them. If the value is None, then the value for the property should be set to either an empty string or the Elasticsearch null_value^1. This will probably increase the index size, but will prevent errors when we try to access a property from the Hit object directly (hit.title). It will still throw an error for properties that do not exist for the items (hit.non_existent_property). This will also probably make it easier to search for items with null values, if we ever want to do that.
  2. Make sure that we always use getattr instead of direct property access when trying to access a hit property. So, instead of hit.title, we should always use getattr(hit, "title"). We could add a linting rule that would prevent us from accessing a property directly to prevent errors in the future.

What do you think, @WordPress/openverse-maintainers , which approach would be preferable?

dhruvkb commented 1 year ago

I would prefer option 2, it makes it more explicit on the API side that we know a value could be absent on the Hit object and doesn't require increasing our index size. We can specify some values that we know for sure to be present on the Hit instances (like identifier, source, provider etc.) and for them allow dot-access.

sarayourfriend commented 1 year ago

No strong opinion here, but to give support for option 1, it creates a consistent API, and doesn't require us to "remember" that title could not exist. It's a bit easier to type as well, if we decided to encode None | str. I don't even know how you type a potentially nonexistent attribute on an object in Python.

We can specify some values that we know for sure to be present on the Hit instances (like identifier, source, provider etc.) and for them allow dot-access.

This sounds nice to me, but I don't know how we would enforce it. We don't have type checking, which would allow us to at least force a check for None. If there's a way to actually allow dot-access for some and not for others, that would be great, but non-obvious conventions and "ambient knowledge" of a niche API is very easy to forget and to introduce a new bug.

The most ideal answer is for all our code paths to get tested with real data for all possible edge cases. That way it just does not matter what approach we take in the code, it will have to handle it one way or another. Making it easy/"the default way" for tests to run against data with known edge cases would go a long way to solve this, regardless of the convention or change in API.

obulat commented 1 year ago

The most ideal answer is for all our code paths to get tested with real data for all possible edge cases. That way it just does not matter what approach we take in the code, it will have to handle it one way or another. Making it easy/"the default way" for tests to run against data with known edge cases would go a long way to solve this, regardless of the convention or change in API.

I would love to have these tests, but I'm afraid it's really difficult because we don't know what we don't know, and there might be edge cases that we are not aware of.

krysal commented 1 year ago

I like option 2, which is more or less what we're already doing, but regardless of said options, I also think that is not a substitute for testing, as Sara said. That is the main goal of this ticket.

I would love to have these tests, but I'm afraid it's really difficult because we don't know what we don't know, and there might be edge cases that we are not aware of. – @obulat

We know these fields are optional, it is our "business model" so it must be kept in mind for these type of code decisions. I'm not sure what other unknowns you're referring but it is part of the purpose of automated testing to reduce that uncertainty.

obulat commented 1 year ago

Sorry for hijacking this issue, @krysal! So, we can say that in this issue, we should add unit tests when the main media items does not some properties, and the related endpoint does not throw an error, but returns a result, right?

We know these fields are optional, it is our "business model" so it must be kept in mind for these type of code decisions. I'm not sure what other unknowns you're referring but it is part of the purpose of automated testing to reduce that uncertainty.

When writing the PR for related endpoint, I knew that some items don't have title, creator or tags. However, I did not know that the hit in Elasticsearch might not have a property at all. I thought it would return either None or an empty string for hit.title, for instance. This is what I mean by "we don't know what we don't know". There are some properties that are not what they are described to be (or expected). For instance, I think we have some licenses saved as upper-case, and some as lower-case. We don't always know (or don't have it documented) whether an empty value is a None, a blank string, or a lack of property. And it can also be different throughout the stack: in the catalog, in the API database, in Elasticsearch index, and in the frontend store. That's why I think we will always have some edge cases that will cause errors somewhere, until we finish data normalization throughout the stack, and all data conforms to the data we expect.