bellingcat / wayback-google-analytics

A lightweight tool for scraping current and historic Google Analytics data
https://pypi.org/project/wayback-google-analytics/
MIT License
192 stars 23 forks source link

6 - Sometimes inaccurate timestamps for archived codes #9

Closed jclark1913 closed 1 year ago

jclark1913 commented 1 year ago

Overview

On occasion, the tool returns incorrect start/end dates for codes. An example:

{
        "https://syria.tv": {
            "current_UA_code": [
                "UA-216952176-1"
            ],
            "current_GA_code": [],
            "current_GTM_code": [
                "GTM-MSBSQ22"
            ],
            "archived_UA_codes": {
                "UA-47211704-1": {
                    "first_seen": "16/02/2015:03:46",
                    "last_seen": "13/01/2016:06:26"
                },
                "UA-97335575-1": {
                    "first_seen": "18/01/2022:16:36",
                    "last_seen": "10/01/2021:16:21"
                },
                "UA-216952176-1": {
                    "first_seen": "09/06/2023:08:46",
                    "last_seen": "09/06/2023:08:46"
                }
            },
            "archived_GA_codes": {},
            "archived_GTM_codes": {
                "GTM-MSBSQ22": {
                    "first_seen": "18/01/2022:16:36",
                    "last_seen": "01/01/2023:03:50"
                }
            }
        }
    },

Last seen is after first seen in some places, and there isn't really a logical progression from code to code.

Cause and possible solution

I think the main culprit here is that the results dictionary is being updated asynchronously. Since we're calling asyncio.gather() on a collection of tasks, it isn't moving orderly and from timestamp to timestamp chronologically. Instead, a bunch of async calls are taking place and whoever finishes first sets "first_seen".

This could be solved by adding some extra logic that assumes that the dates will be processed out of order and updates first/last seen based on the value of the date being processed.

msramalho commented 1 year ago

Indeed this is expected for a list of async tasks where the first to reach a certain point dictates the last/first values, there's no order guarantee.

One option is to do the data gathering in async tasks, as it is now but then collate the result from the list of task results. Another idea is, if this is only an issue for the timestamps, an extra comparison of the timestamp with min/max or </>.