coverified / backend

Backend for the CoVerified Widget
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Data inconsintency compared to current widget data #12

Closed schliflo closed 4 years ago

schliflo commented 4 years ago
There's still something wrong here: Looking at the preview of my PR I noticed some inconsistencies compared to the current version: Current Preview using the API
Screenshot 2020-04-21 at 00 04 55 Screenshot 2020-04-21 at 00 04 44

I checked the scheduled task, that triggers the feed ingestion over here: https://console.cloud.google.com/cloudscheduler?project=coverified2020 but it seems to be running fine. @johanneshiry do you have an idea why this is happening?

johanneshiry commented 4 years ago

Are we using the same feed urls? In my first implementation I used the one from the google spreadsheet but didn't compared them with the github action here. Would you mind posting the feeds to be used here pls?!

schliflo commented 4 years ago

The currently used feeds are here: https://github.com/coverified/data/blob/master/.github/workflows/update.yml#L20 Seems like they pretty much match the feeds configured here: https://github.com/coverified/backend/blob/master/cfg/config.cfg 🤷‍♂️

johanneshiry commented 4 years ago

Thanks for the links - after thinking a little bit about it, it is obvious why there is a difference in the results. Everything is fine. It's not a bug, it's a feature.

--> as discussed initially, the backend filters the headlines for the following keywords:

keywords: ["corona", "covid", "sars-cov", "sars-cov", "sars-cov", "sars-cov", "epidemic"]

(see https://github.com/coverified/backend/blob/master/cfg/config.cfg)

As you can see, the headlines on your left picture do not include one of the keywords, hence they are not persisted in the database. If you want to change this behavior pls let me know.

schliflo commented 4 years ago

Ah, this explains the differences. The current behaviour is, that it matches the keywords against headline & content and then includes the article in the result if any of the two match. In fact, currently each feed entry (JSON) gets flattened down to a single string and is then simply matched against the keyword list. Simple but effective :) (see https://github.com/coverified/webcomponent/blob/851a74b1392ceeca2bcee437edf99fbde426c8de/src/util.js#L46) So this implementation only looks at the headlines and is therefore more restrictive. If we could extend the filter here to also include the content and maybe even the url, then we'll be on the same result level as the current implementation.