[backend] modify rss http getter to a simple fetch (#8736)

OpenCTI-Platform / opencti

Open Cyber Threat Intelligence Platform

https://opencti.io

Other

6.5k stars 956 forks source link

[backend] modify rss http getter to a simple fetch (#8736) #9006

Open JeremyCloarec opened 2 weeks ago

JeremyCloarec commented 2 weeks ago

Proposed changes

instead of using httpClient, use simple fetch to get RSS data

Related issues

8736

Checklist

[ ] I consider the submitted work as finished
[ ] I tested the code for its functionality
[ ] I wrote test cases for the relevant uses case (coverage and e2e)
[ ] I added/update the relevant documentation (either on github or on notion)
[ ] Where necessary I refactored code to improve the overall quality

Further comments

In the related issue, all linked feeds are now fetched without any 403 errors. However, the https://cybersecurity.att.com/site/blog-all-rss feed isn't ingested properly, because items in this feed don't have any pubDate metadata, they only have a dc:date. Not sure if we want to modify the RSS parser to use dc:date if no pubDate exist in the item?

codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 66.25%. Comparing base (96bbd5a) to head (bf2ef75). Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
...rm/opencti-graphql/src/manager/ingestionManager.ts	33.33%	2 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #9006 +/- ## ========================================== - Coverage 66.28% 66.25% -0.04% ========================================== Files 597 597 Lines 61098 61156 +58 Branches 6287 6288 +1 ========================================== + Hits 40501 40521 +20 - Misses 20597 20635 +38 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

JeremyCloarec commented 2 weeks ago

I don't know what are the advantages of using our custom httpClient but are we ok with the fact to bypass it?

I'm not sure either, but it was the only way I was able to bypass the 403 errors. We talked about it with @romain-filigran, and the plan will be to merge it to master and keep a close eye on wether the previous RSS feeds break following this change. If that is the case, this will need to be reverted

aHenryJard commented 3 days ago

I think that the opencti httpclient manages at least proxy configuration, have you test your PR behind a proxy ?

JeremyCloarec commented 3 days ago

I didn't think about proxy settings you're right, this solution doesn't work. I wasn't able to find the root cause, I suspect a Cloudflare bot protection, but I didn't find any proper way to understand what triggers the 403 rejections. What's weird is that even when sending the exact same request as the browser but with curl, I get a 403 error. When the same request in the browser works properly... Would you be up for a pair debugging session to dig into it?