NuGet / NuGetGallery

NuGet Gallery is a package repository that powers https://www.nuget.org. Use this repo for reporting NuGet.org issues.
https://www.nuget.org/
Apache License 2.0
1.54k stars 645 forks source link

Implement and document a way for custom scripts to exclude their downloads from stats #6553

Open joelverhagen opened 6 years ago

joelverhagen commented 6 years ago

Today, certain downloads are not counted in the download count reports, namely certain bots and crawlers.

The top unknown/crawler user agents in the past 7 days are:

UserAgent DownloadCount
NuGet Test Client/5.0.0 (Microsoft Windows 10.0.14393 ) 12014
Pingdom.com_bot_version1.4(http://www.pingdom.com/) 9585
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 1936
NuGet Test Client/4.9.0 (Microsoft Windows 10.0.14393 ) 702
NuGet Test Client 630
NuGet Test Client/5.0.0 (Microsoft Windows 10.0.15063 ) 626
python-requests/2.10.0 615
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 548
python-requests/2.19.1 536
Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) 303

Note that there is a bug in the roll-up that causes these to end up being counted after 42 days anyways: https://github.com/NuGet/NuGetGallery/issues/6552, but that's beside the point.

The top user agents that have an "empty string" category but are clearly not NuGet client implementations in the past 7 days are:

UserAgent DownloadCount
(unknown) 700495
Veracode 168417
NuGetTestModeEnabled 100279
NuGetMirror/4.4.0 86027
Knapcode.ExplorePackages.Bot/4.7.0 69128
Go-http-client/2.0 63458
okhttp/3.9.0 62224
EdgeAccel/2.0 28524
Mozilla/5.0 NuGet 14852
Python-urllib/2.7 9197

Clearly some of these are scripts that shouldn't be included in the download counts. We should implement a way for a certain pattern of user agents (perhaps including the substring (bot)) to be excluded from the download counts. We should also document this approach in the API docs and recommend that users specify a user agent (RFC should 😄?).

We'll need to update the user agent parser to look for this pattern. Today, unexpected/custom user agents are given the client name (unknown) and the empty string client category.

xavierdecoster commented 6 years ago

@joelverhagen FYI, NuGetTestModeEnabled is the undocumented way that is built-in into the NuGet client :) References: