geerlingguy / jeffgeerling-com

Drupal Codebase for JeffGeerling.com
https://www.jeffgeerling.com
GNU General Public License v2.0
41 stars 2 forks source link

RSS feed showing duplicates in some feed readers #145

Closed geerlingguy closed 2 years ago

geerlingguy commented 2 years ago

It looks like one more bit of fallout from the #141 DDoS attacks and mitigations is a broken-for-some-users RSS feed:

$ curl https://www.jeffgeerling.com/blog.xml 2>/dev/null | grep "guid isPermaLink" | head -1
    <guid isPermaLink="false">3189 at http://www.jeffgeerling.com</guid>

$ curl https://www.jeffgeerling.com/blog.xml\?abcde 2>/dev/null | grep "guid isPermaLink" | head -1
    <guid isPermaLink="false">3189 at https://www.jeffgeerling.com</guid>
156076227-5cef00bb-e8d8-4d57-abdd-72d90713d192

Basically, there are cases where the URL returned in the guid has an http, and others where it's https. I'm not exactly sure how this is happening through Cloudflare (it never happened before)—but I'm guessing somehow Cloudflare is passing through http requests sometimes (even though I have "Full (strict)" enabled), and those are getting cached with the wrong guid's.

Screen Shot 2022-03-08 at 11 12 18 AM

Internally in Nginx, I have redirects from http to https though, so I'm also not sure how the http could ever get through to a rendered feed...

geerlingguy commented 2 years ago

Searching around a bit, I also found the issue on Drupal.org: Option to force URLs to return HTTPS instead of HTTP.

And indeed, I have in my Drupal configuration (in a local.settings.php):

$settings['reverse_proxy'] = TRUE;
$settings['reverse_proxy_addresses'] = ['IP_OF_SERVER_HERE'];

But how does this work if I also want Drupal to detect all Cloudflare IP addresses as reverse_proxy_addresses? And is that affected by Nginx still fronting the requests?

Edit: Also, a note from the metatag issues: https://www.drupal.org/project/metatag/issues/2842049#comment-14260772

geerlingguy commented 2 years ago

lol, of course I've written my own blog post on the topic... Configuring CloudFlare with Drupal 8 to protect the Pi Dramble.

geerlingguy commented 2 years ago

I just updated my Drupal config to use:

// Reverse proxy - Cloudflare.
$settings['reverse_proxy'] = TRUE;
$settings['reverse_proxy_addresses'] = array($_SERVER['REMOTE_ADDR']);
$settings['reverse_proxy_header'] = 'HTTP_CF_CONNECTING_IP';

We'll see if this fixes anything.

geerlingguy commented 2 years ago

Hmm... though reverse_proxy_header is deprecated / removed in Drupal 9. Grr.

geerlingguy commented 2 years ago

Nick Craver mentions I could enable HSTS on my domain (see https://twitter.com/Nick_Craver/status/1501248041004716043 and the rest of that thread) and it might help resolve it too. Though it'd be nice to make sure all the settings are correct up and down the stack.

geerlingguy commented 2 years ago

I might go with this sample code from this comment:

if (isset($_SERVER['HTTP_CF_CONNECTING_IP'])) {
  // If the CloudFlare header is contained in the X-Forwarded-For header, then
  // all IP addresses to the right of that entry are reverse-proxies, which are
  // additional to the value in $_SERVER['REMOTE_ADDR].
  // E.g. <client> --- <CDN> --- <Varnish> --- <drupal>.
  $client = $_SERVER['HTTP_CF_CONNECTING_IP'];
  $ips = explode(', ', $_SERVER['HTTP_X_FORWARDED_FOR']);
  if ($keys = array_keys($ips, $client)) {
    $position = end($keys);
    $reverseProxies = array_slice($ips, $position + 1);
    $reverseProxies[] = $_SERVER['REMOTE_ADDR'];

    $settings['reverse_proxy'] = TRUE;
    $settings['reverse_proxy_addresses'] = $reverseProxies;
  }
}
geerlingguy commented 2 years ago

Commit above has the code I added to live local.settings.php.

ChrisLawther commented 2 years ago

Provider of the curl output here, attempting to be helpful.

I'm guessing somehow Cloudflare is passing through http requests sometimes

Internally in Nginx, I have redirects from http to https though, so I'm also not sure how the http could ever get through to a rendered feed...

Given that a reader has seen a feed with plain http URLs in it, the redirect could be considered a good thing as it means attempts to read the full article page will succeed. Although the goal here is rightly not to publish those URLs to the feed in the first place.

As for how/why the feed has plain http URLs in it in the first place, I'm afraid I have no knowledge of Drupal, but in an attempt to rubber-duck towards a resolution:

(I'm not looking for answers here - I'm trying to find the right question to help you realise the cause)

As a curious observer, it seems there must be multiple Drupal instances at play, with varying senses of "self". How does a Drupal instance know it's identity? Does it get the scheme the same way? Are the nodes all definitely configured identically?

If that's what this is already trying to address:

I just updated my Drupal config to use:

// Reverse proxy - Cloudflare.
$settings['reverse_proxy'] = TRUE;
$settings['reverse_proxy_addresses'] = array($_SERVER['REMOTE_ADDR']);
$settings['reverse_proxy_header'] = 'HTTP_CF_CONNECTING_IP';

We'll see if this fixes anything.

... I can report that your latest article ("Rate limiting requests per IP address in Nginx") appears twice in my feed.

ChrisLawther commented 2 years ago

Oh, and I've just noticed the significance of the wording of the issue title:

RSS feed showing duplicates if someone subscribed to http version

I have not subscribed to the http version. My feed reader is configured to fetch https://www.jeffgeerling.com/blog.xml - it's always the same request from me, but the links in the returned content vary.

geerlingguy commented 2 years ago

@ChrisLawther - Okay, in that case I've fixed the title. I'm going to give it a couple days and we'll see if the next post hits the same issue. I don't have the time to dive into some CF requests on the server itself right now but that'll be the next step, to see where Drupal's getting its http requests from.

ChrisLawther commented 2 years ago

I was wondering whether anything in the response headers might point towards a misconfigured Drupal instance. If I compare an http-returning response with an https-returning one, the only differences that aren't simply time related are:

report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=YiRbRdmXT3HOZiy%2FEAe6znimuIFEYJP7OIfa4vdCucCM6wPOnuKCHURPmtwvkWvImGqB02v9vpOU3LEAQwQSGP8wcVm5nWN8kaR54cEK1MlquGcFeSc1P6dm5jeIfOqcteW5TW7EUA%3D%3D"}],"group":"cf-nel","max_age":604800} v.s. report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=MMrO673IS6%2FHL2VJnchOCF4J2b1hsOyxpxAGn73Uv81w7qZ9wReORTUO%2BtNZUbbwhnOWG9uLgSXHzMMIq8GksbDDUYpmaPVsxVdXLHUbU%2BnSjYzOnmrB4QNCnjrzKkEqVrR3RHssmQ%3D%3D"}],"group":"cf-nel","max_age":604800}

And

cf-ray: 6e9dcdcd1b697747-LHR v.s. cf-ray: 6e9dce404e6b72e8-LHR

... but they may both be CloudFlare internal details and nothing to do with the actual feed generation.

PH4NTOMiki commented 2 years ago

cf-ray: 6e9dcdcd1b697747-LHR v.s. cf-ray: 6e9dce404e6b72e8-LHR

CF-Ray header is like a request ID, LHR part means you are connecting to London data center (they are using nearest airport code) so in your case London Heathrow.

report-to header is for bot/spam protection and network logging(this is the part that applies to CF).

PH4NTOMiki commented 2 years ago

From https://developers.cloudflare.com/fundamentals/get-started/http-request-headers/ The CF-ray header is a hashed value that encodes information about the data center and the visitor’s request.

And report-to reference https://support.cloudflare.com/hc/en-us/articles/360050691831-Understanding-Network-Error-Logging

geerlingguy commented 2 years ago

Someone else reported the issue today, too:

Screen Shot 2022-03-10 at 2 51 56 PM

This is from Miniflux.

PH4NTOMiki commented 2 years ago

Nick Craver mentions I could enable HSTS on my domain (see https://twitter.com/Nick_Craver/status/1501248041004716043 and the rest of that thread) and it might help resolve it too.

@geerlingguy I think most of the RSS consumers don't respect HSTS.

PH4NTOMiki commented 2 years ago

I saw in your RSS feed that the link tag is to http variant and Drupal sets it here https://github.com/drupal/drupal/blob/515d10367bbe5cc158153a90e7960f92c2862745/core/modules/views/templates/views-view-rss.html.twig#L24 and that link variable is being populated by Url::fromRoute('<front>')->setAbsolute()->toString() here https://github.com/drupal/drupal/blob/515d10367bbe5cc158153a90e7960f92c2862745/core/modules/views/views.theme.inc#L888 Can you try to do search-replace in your DB with something like PHPmyadmin, replace http://www.jeffgeerling.com/ with https://www.jeffgeerling.com/

geerlingguy commented 2 years ago

@PH4NTOMiki - The problem is Drupal generates URLs on the fly, and the protocol is determined by how Drupal sees the request come in—I'm pretty sure for some reason some requests from Cloudflare are being returned as non-https, for some reason or another. When I wasn't using Cloudflare I never had that issue, because I only had one proxy in front of Drupal (Nginx), and I could easily detect when the proxy was being used. For some reason some requests from Cloudflare seem to bypass the proxy logic I added a few comments earlier, and that's when I'm guessing Drupal's generating a non-https feed that's also getting cached by Cloudflare.

PH4NTOMiki commented 2 years ago

Do you have enabled HTTPS Always Use HTTPS in Cloudflare dashboard https://developers.cloudflare.com/ssl/edge-certificates/additional-options/always-use-https/#encrypt-all-visitor-traffic

geerlingguy commented 2 years ago

@PH4NTOMiki - Yes.

PH4NTOMiki commented 2 years ago

Do you have fastcgi_param HTTPS on; in nginx config?

geerlingguy commented 2 years ago

@PH4NTOMiki - I didn't, though I just forced it to on in /etc/nginx/fastcgi_params, restarted Nginx, and cleared caches on Cloudflare...

 21:25:42 ~ 
$ curl https://www.jeffgeerling.com/blog.xml 2>/dev/null | grep "guid isPermaLink" | head -1
    <guid isPermaLink="false">3191 at https://www.jeffgeerling.com</guid>

 21:25:43 ~ 
$ curl https://www.jeffgeerling.com/blog.xml\?asa 2>/dev/null | grep "guid isPermaLink" | head -1
    <guid isPermaLink="false">3191 at https://www.jeffgeerling.com</guid>

We'll see if that fix holds!

PH4NTOMiki commented 2 years ago

Fingers crossed, I'll make a script to test some URLs, hopefully they all come up as https. Will report back

PH4NTOMiki commented 2 years ago

Fingers crossed, I'll make a script to test some URLs, hopefully they all come up as https. Will report back

I tested multiple routes and everyone came as https.

geerlingguy commented 2 years ago

Sounds like I owe @PH4NTOMiki a beer!

I'll close this for now—if anyone sees the duplicates again, please let me know!