DanDiplo / Umbraco.MediaDownload

Adds a download option to the Umbraco Media library
4 stars 1 forks source link

Consider adding additional detail to the listing on Umbraco marketplace #3

Closed AndyButland closed 1 year ago

AndyButland commented 1 year ago

I noted you'd listed this package on the Umbraco marketplace, so firstly, thanks for taking the time to do that. I noticed though that we only have the details collected from NuGet for the package... which is fine, but to help it to be found more easily it's possible to provided additional information, as is discussed here.

For example, you can add a category that the package will be shown under, and also provide additional descriptions or "read me" content to better describe what the package does and what benefits developers would get from installing it.

To use this you need to create a file in the root of the project URL you are referencing in the NuGet package - so in this case, at https://www.diplo.co.uk/blog/web-development/umbraco-marketplace-diplo.mediadownloader.json - and populate it with the additional information in JSON format, as per the documentation linked above.

DanDiplo commented 1 year ago

Hi Andy. I did actually create a JSON file for it - it's https://github.com/DanDiplo/Umbraco.MediaDownload/blob/master/umbraco-marketplace.json - you can see it was added 4 months ago. I think the issue is that in the NuGet package I didn't reference the Github project but my blog post. So that would explain why it was never found!

Anyway, I've done as you suggested and the file can now be accessed from: https://www.diplo.co.uk/blog/web-development/umbraco-marketplace-diplo.mediadownloader.json

Thanks for the heads up!

AndyButland commented 1 year ago

Thanks - looks good, though I'm seeing odd behaviour when we try to load from that file from Azure... giving a 403 Forbidden response. However if I run the sync process locally, it's fine. So not sure if there's anything you can do about that... but just letting you know in case you are wondering why there's still no update.

DanDiplo commented 1 year ago

The site does use Cloudflare, so I guess it's possible it could be something to do with that. Looking in the WAF logs I can see it's blocking a request coming from a Microsoft network to that file:

image

Are you able to add a valid user-agent string to the request? I'm guessing the WAF will try and challenge anything that looks like a bot, and lack of user-agent is a usual qualifier for that.

AndyButland commented 1 year ago

Could well be that, thanks for the info. I'll give it a try.

AndyButland commented 1 year ago

I've made an update this morning @DanDiplo to add a user agent of "UmbracoMarketplaceBot". Unfortunately I'm still seeing it blocked.

If you have time, would you mind seeing if you can find this request (was at 7:40 UK time) please? Assuming you confirm you do see this user agent in the request details, it looks like you could configure Cloudflare to allow it. Thanks.

DanDiplo commented 1 year ago

Hi Andy. Sorry for the trouble - I can indeed see the request coming in:

{
  "action": "managed_challenge",
  "clientASNDescription": "MICROSOFT-CORP-MSN-AS-BLOCK",
  "clientAsn": "8075",
  "clientCountryName": "NL",
  "clientIP": "51.124.135.75",
  "clientRequestHTTPHost": "www.diplo.co.uk",
  "clientRequestHTTPMethodName": "GET",
  "clientRequestHTTPProtocol": "HTTP/1.1",
  "clientRequestPath": "/blog/web-development/umbraco-marketplace-diplo.mediadownloader.json",
  "clientRequestQuery": "",
  "datetime": "2023-03-07T07:12:41Z",
  "rayName": "7a4108123be41ee7",
  "ruleId": "bot_fight_mode",
  "rulesetId": "",
  "source": "botFight",
  "userAgent": "UmbracoMarketplaceBot",
  "matchIndex": 0,
  "metadata": [],
  "sampleInterval": 1
}

So the UA is set as you say and is being "seen". But Cloudflare is still instigating robot wars!

My guess is that Cloudflare doesn't recognise the UA and I guess assumes it's a bot. It might be the fact it's also requesting a JSON file triggers some heuristic. Hard to know, as CF don't explain why the rule is firing!

Anyway, I've added an exclusion for the UA of UmbracoMarketplaceBot so hopefully that fixes it for my site. Though it's possible you might have other sites give you the same trouble. Sorry for the hassle! Let me know if this works...

AndyButland commented 1 year ago

No problem, thank you for helping trying to resolve it. As you say, it could affect other sites too, so we might need to document the user-agent that needs to be unblocked.

Unfortunately though, I'm still getting the forbidden response. I tried again at 9:36 UK time. Don't suppose you see anything different on your side do you now you've added the firewall rule?

DanDiplo commented 1 year ago

Hi again. I've definitely added the rule:

image

However, reading the Cloudflare docs carefully and checking their forum then it appears that you can't bypass Bot Fight Mode, even when a rule says "Allow":

"The scope of the Allow action is limited to firewall rules; matching requests are not exempt from action by other Cloudflare security products such as Bot Fight Mode, IP Access Rules, and WAF Managed Rules."

https://developers.cloudflare.com/firewall/cf-firewall-rules/actions/

All you can do is disable bot fight mode, which I'm not particular keen on doing (and won't help you with other sites).

I imagine getting around it's not easy. You probably can't mask as a know good bot (eg. Google) as they probably have a DB of IPs to match UAs.

I actually wrote a scraper a few years ago and it was very tricky getting around some firewall rules. What I found was that it was best to emulate a browser UA string, but also ensure you add other headers, such as "accept". Also ensure the headers are sent in the correct order, because if you pretend to be a browser they can check the headers are sent the same way as a browser.

Looking at the code I was doing this, if it helps (it's old so is still NET Framework, but you get the idea).

            this.Client.DefaultRequestHeaders.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/apng,*/*;q=0.8");
            this.Client.DefaultRequestHeaders.AcceptEncoding.Add(new StringWithQualityHeaderValue("gzip"));
            this.Client.DefaultRequestHeaders.AcceptEncoding.Add(new StringWithQualityHeaderValue("deflate"));
            this.Client.DefaultRequestHeaders.Add("Accept-Language", "en-GB,en;q=0.9,en-US;q=0.8");
            this.Client.DefaultRequestHeaders.Add("Connection", "keep-alive");
            this.Client.DefaultRequestHeaders.Add("Cache-Control", "no-cache");
            this.Client.DefaultRequestHeaders.Add("Pragma", "no-cache");
            this.Client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/27.0.1453 Safari/537.36");

No idea if that helps, but as I say I know from experience bot detection is sophisticated now!

AndyButland commented 1 year ago

Thanks, yes, I figure we'll just be chasing our tail at best if we try to use impersonation. And I've been reading about "bot fight mode" from Cloudflare too - it seems you are right and you can only override it by IP rules, not user agent ones. There seem to be quite a few community support requests about it!

I can give you the list out outbound IPs for the synchronisation service - but, it's Azure, so there are quite a few I'm afraid. They should be stable, but all could be used.

If you are minded to set them up as exceptions, here they are:

20.126.185.183
20.126.185.201
20.126.185.205
20.103.180.234
20.126.186.39
20.126.186.100
20.105.224.12
20.126.185.183
20.126.185.201
20.126.185.205
20.103.180.234
20.126.186.39
20.126.186.100
20.126.186.221
20.126.187.99
20.126.189.55
20.126.189.103
20.126.189.153
20.126.189.191
20.126.189.221
20.126.190.94
20.126.190.147
20.126.190.160
20.126.190.202
20.126.190.252
20.126.191.32
20.126.191.64
20.126.191.140
20.126.191.240
20.93.227.32
20.31.105.6
20.31.105.125
20.31.106.58
20.31.106.127
20.31.106.244
20.93.229.65
20.31.107.10
20.105.224.12

But if you don't want to bother, that's fine too of course... looks like it'll be a bit time-consuming.

Meantime I'll do a manual update of your data so at least we have what you've provided. That will stick until such time the bot does get a response from your file, when it will load from that and replace what I've manually added.

DanDiplo commented 1 year ago

Hi Andy. I've added the IPs (after removing the duplicates) but still not sure if it will work as I still think the WAF rules run after the bot fight mode. So I don't think it will work. There might be more options in the pro-account, but in my one I can't see a way of doing it.

Long term I'll just look at moving the file to Github like my other projects. If you can "force" this then the file is: https://raw.githubusercontent.com/DanDiplo/Umbraco.MediaDownload/master/umbraco-marketplace.json

AndyButland commented 1 year ago

OK, thanks for trying it out anyway Dan. Could be we have to come back to this particularly if we have issues with other setups, but for now as discussed I've done a manual update of your package with the info provided, which you can see here.

As mentioned, this will stick until such time the bot does get a response from your file, when it will load from that and replace what I've manually added.

I'll close this issue as think that's all we can do for now. I've learnt a bit about Cloudflare at least!