GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
589 stars 91 forks source link

exclude bot traffic from tracking stats #4452

Closed FuhuXia closed 11 months ago

FuhuXia commented 1 year ago

Tracking update job was changed from nightly to weekly due to its long processing time. It is speculated that bot crawling traffic is the cause for the long processing time. Bot crawling traffic makes the popularity count and tracking stats less meaningful, its crawling each and every dataset makes tracking update job unnecessarily longer to process. After user-agent ticket we are now able to differentiate bot traffic from regular user visits, we should try to exclude bot traffic from tracking stats.

Sketch

FuhuXia commented 11 months ago

Inspected 19 days of CloudFront logs from 2023-09-01 to 2023-09-19:

Raw file size: 15 GB Total requests: 19 million (19,435,508) Requests made by all bots: 9 million (8,556,836) Requests made by biggest bot Googlebot: 4 million (4,159,174)

Top 10 bots' requests:

Googlebot\/: 4159174
PetalBot: 1684301
YisouSpider: 275729
[wW]get: 237275
Bytespider: 230629
GPTBot: 229068
python-requests: 200007
axios: 197848
Y!J: 183735
bingbot: 179556

Not all bots parse javascript and made requests to /_tracking, for those who do:

Bots traffic to /_tracking

Googlebot\/: 804965
Y!J: 55168
Yeti: 10050
Bytespider: 9645
Applebot: 3236
HeadlessChrome: 522
Baiduspider: 260
YisouSpider: 250
Chrome-Lighthouse: 171
Cincraw: 149
Google-Read-Aloud: 125
yandex\.com\/bots: 104
bingbot: 48
heritrix: 38
SeekportBot: 30
PetalBot: 27
Google-Safety: 6
Google-InspectionTool: 6
Blackboard: 5
HubSpot: 4
Ahrefs(Bot|SiteAudit): 3
proximic: 3
Dataprovider.com: 2
facebookexternalhit: 2
Google-Structured-Data-Testing-Tool: 1
BingPreview\/: 1
archive.org_bot: 1
AdsBot-Google([^-]|$): 1
SkypeUriPreview: 1

Top five bots count for 99.8% of total bot tracking data, which means all we need to do is to exclude 5 bots.

The difference between human traffic and bot traffic is:

human focus on pupolar datasets, bots' interests are widely spreaded, as shown in this data:

For all datasets (roughly represented by "GET /dataset/.+") visited in this period: 91% of them was never visited by human. Googlebot visited 70% of them.

For those most popular datasets such as electric-vehicle-population-data and fdic-failed-bank-list, Human visits count 99% Bot visits count 1%

Conclusion:

If we exclude top 5 bots traffic from entering tracking data, we will reduce 80-90% of workload for tracking-update script, while seeing 1% of drop in the top visited dataset popular count.

FuhuXia commented 11 months ago

The PR above has stopped traffic from the top 6 bots accessing the "/_tracking" endpoint. The preliminary analysis suggests that this change could potentially lead to an 80-90% reduction in execution time for the tracking-update script. Note that these estimates are based on several assumptions. To accurately assess the actual impact, we will run a few nightly tracking-update task three days later and check the real-world effects. Hopefully a 12-hour weekly job can become a 2-hour nightly job.

FuhuXia commented 11 months ago

PR deployed on 02 Oct 2023 20:50:02 GMT.

One manual tracking_update run executed:

2023-10-05T10:22:06.10-0400 [APP/TASK/58121c57/0] OUT 2023-10-05 14:22:06,106 INFO
[ckanext.geodatagov] 105470 package indexes to be rebuilt starting from 2023-09-29 00:00:00

Will run another one tomorrow, do some calculations to have a good estimate on the workload of nightly job.

FuhuXia commented 11 months ago

Another manual tracking_update run:

2023-10-06 17:15:57,123 INFO  [ckanext.geodatagov] 8337 package indexes to be rebuilt starting from 2023-10-03 00:00:00

So nightly job will be indexing 8k dataset. It can be done in 40 mins.

FuhuXia commented 11 months ago

This PR changes tracking-update from week job to nightly job. Before bot traffic gets excluded, it takes 4-5 hours for a nightly job or 10-12 hour weekly job to finish a tracking update. With this issue resolved, we are going back to a nightly job which can be done < 1 hour.

FuhuXia commented 11 months ago

The nightly jobs for the past a few days are looking good, finished in 40-50 mins.