Closed schliflo closed 4 years ago
Are we using the same feed urls? In my first implementation I used the one from the google spreadsheet but didn't compared them with the github action here. Would you mind posting the feeds to be used here pls?!
The currently used feeds are here: https://github.com/coverified/data/blob/master/.github/workflows/update.yml#L20 Seems like they pretty much match the feeds configured here: https://github.com/coverified/backend/blob/master/cfg/config.cfg 🤷♂️
Thanks for the links - after thinking a little bit about it, it is obvious why there is a difference in the results. Everything is fine. It's not a bug, it's a feature.
--> as discussed initially, the backend filters the headlines for the following keywords:
keywords: ["corona", "covid", "sars-cov", "sars-cov", "sars-cov", "sars-cov", "epidemic"]
(see https://github.com/coverified/backend/blob/master/cfg/config.cfg)
As you can see, the headlines on your left picture do not include one of the keywords, hence they are not persisted in the database. If you want to change this behavior pls let me know.
Ah, this explains the differences. The current behaviour is, that it matches the keywords against headline & content and then includes the article in the result if any of the two match. In fact, currently each feed entry (JSON) gets flattened down to a single string and is then simply matched against the keyword list. Simple but effective :) (see https://github.com/coverified/webcomponent/blob/851a74b1392ceeca2bcee437edf99fbde426c8de/src/util.js#L46) So this implementation only looks at the headlines and is therefore more restrictive. If we could extend the filter here to also include the content and maybe even the url, then we'll be on the same result level as the current implementation.
I checked the scheduled task, that triggers the feed ingestion over here: https://console.cloud.google.com/cloudscheduler?project=coverified2020 but it seems to be running fine. @johanneshiry do you have an idea why this is happening?