WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
254 stars 203 forks source link

Older results are favored regardless of quality #715

Open AetherUnbound opened 2 years ago

AetherUnbound commented 2 years ago

Description

It appears that older results are favored, regardless of image quality.

Reproduction

  1. Visit https://search-staging.openverse.engineering/search/image?q=computer
  2. Observe that the first few results are from the 2000s
  3. Visit https://search-staging.openverse.engineering/search/image?q=computer&source=stocksnap
  4. Observe that the images are higher quality

Additional context

Resolution

zackkrida commented 2 years ago

Very, very interesting. I suppose this has to do with our 'default' sort order? Would we want to sort by new by default, once popularity has been considered?

raghuvar-arora-au2 commented 2 years ago

Hi, I would like to work on this.

raghuvar-arora-au2 commented 2 years ago

are we crawling the "created date" of the media?

krysal commented 2 years ago

@raghuvar-arora-au2 You're already assigned to three issues here, please complete or release them before taking on more work. Thanks!

Also, this ticket needs some discussion on the approach to solve it as it will have a major impact on the search.

raghuvar-arora-au2 commented 2 years ago

Sure. I have made PRs for 2 of them, already. I strongly believe this ticket should be a higher priority, as older records are not just of lower image quality, but may also not be relevant anymore. Consider the examples given by madison. This severely hurts the usability of the application. I do understand that it may require massive changes across the application. Can you include me in the discussion whenever it happens, please? I would like to work on this problem.

sarayourfriend commented 2 years ago

@raghuvar-arora-au2 We'll have the discussions here, though if you have any specific ideas you'd like to discuss for this issue, another place to have that is the WordPress Make Slack chat that we hold on a weekly basis. Details about that can be found here: https://github.com/WordPress/openverse/blob/main/CONTRIBUTING.md#-keeping-in-touch

However, I think keeping most of the discussion on this issue is for the best for now. I (and I think most of the maintainers) strongly agree that it "severely hurts the usability of the application."

If you have any ideas for how to begin evaluating a solution for this, please share them and we can let you know if they're on the right track. For my part, I think we could potentially start by documenting exactly how results are ordered currently. By doing that it might illuminate a path forward for more intentionally ordering them in a way that makes more sense.

Another thing I was thinking of in this regard is whether we could somehow order images by a perceived "quality" metric. Maybe even something as basic as trying to detect the level of compression in an image and boost images with low compression. In this issue @thedevhaider shared some ideas for how to use frequencies to decide between two similar images. I wonder if we could generate a "score" based on something like that to share with ES for ordering? I'm concerned it would only work for photography though, which could unfairly disadvantage illustrations or other types of images.

Lots to think about here though and am curious to hear what you think a good path forward would be!

zackkrida commented 2 years ago

I have another, silly question that I'd love folks opinion on. If this is currently true:

It appears that older results are favored, regardless of image quality.

How would we feel if the inverse was true?

It appears that newer results are favored, regardless of image quality.

Would that be preferable for any reason? If that was an extremely small change to make, for example, but provided any level of benefit, it might be a good first step.

sarayourfriend commented 2 years ago

That's kind of what I'm getting at too, Zack. I don't think the right solution to this is to just call .reverse() on the ES index :stuck_out_tongue: If we're unhappy with the current ordering (and we all are) then we need to start thinking about what we want to feed into our scores for each document so ES knows what we want from it.

raghuvar-arora-au2 commented 2 years ago

Hi @sarayourfriend, I have a few temporary mitigations schemes in mind, although I'm not sure if they are even feasible 1) We can give a higher score to records from sources that are relatively modern. Check the example given by @AetherUnbound in the Description. 2) I'm yet to look into the crawler, but if it has been crawling only through the recent uploads lately, we can boost the score by their crawled date. We may have to reindex every month or a couple of weeks. I'm assuming the crawler does not scrap the date the media was uploaded and the idea that possibly it is done with crawling the older records and only scaping the new ones now. 3) Only use modern sources by default on the application.

A more permanent solution would be an extension on 2). We may require the date the media was uploaded (I do not know if we are scraping that while crawling), and we can use this field to boost the score. Reindexing will be required periodically if we are scoring records during the ingestion.

PS: I do not have any experience with elastic search, I only learned about it the last week.

raghuvar-arora-au2 commented 2 years ago

@sarayourfriend Do you want me to try and implement one of the solutions?

sarayourfriend commented 2 years ago

I'm not sure that we can jump into implementation on this issue yet @raghuvar-arora-au2. To @zackkrida's point, simply favoring more recent results isn't guaranteed to improve relevancy for all or even a majority of searches. If you check out the discussion in https://github.com/WordPress/openverse/issues/1573, I think they're intimately related. If you want to work on improving the search result relevancy and quality, then we'd probably want to go through our RFC process and cover at least the following:

  1. How are results currently ordered
  2. Define relevancy
  3. Define quality
  4. Based on the information we have today of the images, what are improvements we can make to the relevancy and quality
  5. What are further improvements we could make that would require more data (perhaps using data we can only get from the actual image files themselves, for example)

It would probably be good to scope this just to images for now unless there are any obvious things that could benefit the other media types. I anticipate one of the tricker parts of this will be that once we've defined relevancy and quality for our context, how do we prioritize them when scoring results? There's got to be some interplay there. I suspect (strongly) that highly relevant results will not always be the highest quality (as far as some judgement of the individual work goes). Maybe "quality" isn't even a good marker that we can use for ordering the results.

raghuvar-arora-au2 commented 2 years ago

I had been doing a bit of reading, and found that temporal signals does play a role in real-time search engines. Although, I don't have anything concrete to put into an RFC, but I'll let you know once I do.

sarayourfriend commented 2 years ago

Awesome! As you find helpful resources please feel free to share them here :rocket:

real-time search engines

I suppose this might be an foundational question for us to ask, whether Openverse is a real-time search engine :thinking: