bellingcat / wayback-google-analytics

A lightweight tool for scraping current and historic Google Analytics data
https://pypi.org/project/wayback-google-analytics/
MIT License
192 stars 23 forks source link

start and end parameters often result in error #24

Closed Tiptop4792 closed 9 months ago

Tiptop4792 commented 9 months ago

Hey! First of all: great tool - this can be super useful!

I've been dabbling around with it for a bit, but it still throws a lot of errors. I especially encounter a couple of issues with the start and end parameters, e.g.:

This one works:

wayback-google-analytics -u https://tagesschau.de https://washingtonpost.com https://nytimes.com https://spiegel.de -s 01/01/2015 -f yearly -o xlsx

While this one throws an error:

wayback-google-analytics -u https://tagesschau.de https://washingtonpost.com https://nytimes.com https://spiegel.de -e 01/01/2015 -f yearly -o xlsx

Also, this one doesn't work:

wayback-google-analytics -u https://tagesschau.de https://washingtonpost.com https://nytimes.com https://spiegel.de -s 01/01/2012 -e 01/01/2015 -f yearly -o xlsx

That's what I'm getting in the console:

wayback-google-analytics -u https://tagesschau.de https://washingtonpost.com https://nytimes.com https://spiegel.de -s 01/01/2012 -e 01/01/2015 -f yearly -o xlsx Retrieving current codes for: https://tagesschau.de Finished gathering current codes for: https://tagesschau.de Retrieving archived codes for: https://tagesschau.de CDX url: http://web.archive.org/cdx/search/cdx?url=https://tagesschau.de&matchType=domain&filter=statuscode:200&fl=timestamp&output=JSON&collapse=timestamp:4&limit=5&from=20120101000000&to=20150101000000 Timestamps from CDX api: ['20120101145508', '20130101134651', '20140101002220'] Retrieving codes from url: https://web.archive.org/web/20140101002220/https://tagesschau.de Finish gathering codes for: https://web.archive.org/web/20140101002220/https://tagesschau.de Retrieving codes from url: https://web.archive.org/web/20130101134651/https://tagesschau.de Finish gathering codes for: https://web.archive.org/web/20130101134651/https://tagesschau.de Retrieving codes from url: https://web.archive.org/web/20120101145508/https://tagesschau.de Finish gathering codes for: https://web.archive.org/web/20120101145508/https://tagesschau.de Finished retrieving archived codes for: https://tagesschau.de Retrieving current codes for: https://nytimes.com Finished gathering current codes for: https://nytimes.com Retrieving archived codes for: https://nytimes.com CDX url: http://web.archive.org/cdx/search/cdx?url=https://nytimes.com&matchType=domain&filter=statuscode:200&fl=timestamp&output=JSON&collapse=timestamp:4&limit=5&from=20120101000000&to=20150101000000 Timestamps from CDX api: ['20120101010418', '20130101011853', '20140101005350'] Retrieving current codes for: https://spiegel.de Finished gathering current codes for: https://spiegel.de Retrieving archived codes for: https://spiegel.de CDX url: http://web.archive.org/cdx/search/cdx?url=https://spiegel.de&matchType=domain&filter=statuscode:200&fl=timestamp&output=JSON&collapse=timestamp:4&limit=5&from=20120101000000&to=20150101000000 Retrieving codes from url: https://web.archive.org/web/20140101005350/https://nytimes.com Finish gathering codes for: https://web.archive.org/web/20140101005350/https://nytimes.com Retrieving codes from url: https://web.archive.org/web/20130101011853/https://nytimes.com Finish gathering codes for: https://web.archive.org/web/20130101011853/https://nytimes.com Timestamps from CDX api: ['20120101003932', '20130101025925', '20140101051638', '20120102042211', '20130107072516'] Retrieving codes from url: https://web.archive.org/web/20130101025925/https://spiegel.de Finish gathering codes for: https://web.archive.org/web/20130101025925/https://spiegel.de Retrieving codes from url: https://web.archive.org/web/20140101051638/https://spiegel.de Finish gathering codes for: https://web.archive.org/web/20140101051638/https://spiegel.de Retrieving codes from url: https://web.archive.org/web/20120102042211/https://spiegel.de Finish gathering codes for: https://web.archive.org/web/20120102042211/https://spiegel.de Retrieving codes from url: https://web.archive.org/web/20120101010418/https://nytimes.com Finish gathering codes for: https://web.archive.org/web/20120101010418/https://nytimes.com Finished retrieving archived codes for: https://nytimes.com Retrieving codes from url: https://web.archive.org/web/20130107072516/https://spiegel.de Finish gathering codes for: https://web.archive.org/web/20130107072516/https://spiegel.de Retrieving codes from url: https://web.archive.org/web/20120101003932/https://spiegel.de Finish gathering codes for: https://web.archive.org/web/20120101003932/https://spiegel.de Finished retrieving archived codes for: https://spiegel.de Retrieving current codes for: https://washingtonpost.com Finished gathering current codes for: https://washingtonpost.com Retrieving archived codes for: https://washingtonpost.com CDX url: http://web.archive.org/cdx/search/cdx?url=https://washingtonpost.com&matchType=domain&filter=statuscode:200&fl=timestamp&output=JSON&collapse=timestamp:4&limit=5&from=20120101000000&to=20150101000000 Timestamps from CDX api: ['20120101003932', '20130101002535', '20140101000350', '20120229065623', '20130626071932'] Retrieving codes from url: https://web.archive.org/web/20130626071932/https://washingtonpost.com Finish gathering codes for: https://web.archive.org/web/20130626071932/https://washingtonpost.com Retrieving codes from url: https://web.archive.org/web/20130101002535/https://washingtonpost.com Finish gathering codes for: https://web.archive.org/web/20130101002535/https://washingtonpost.com Retrieving codes from url: https://web.archive.org/web/20140101000350/https://washingtonpost.com Finish gathering codes for: https://web.archive.org/web/20140101000350/https://washingtonpost.com Retrieving codes from url: https://web.archive.org/web/20120101003932/https://washingtonpost.com Finish gathering codes for: https://web.archive.org/web/20120101003932/https://washingtonpost.com Retrieving codes from url: https://web.archive.org/web/20120229065623/https://washingtonpost.com Finish gathering codes for: https://web.archive.org/web/20120229065623/https://washingtonpost.com Finished retrieving archived codes for: https://washingtonpost.com [{'https://tagesschau.de': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://washingtonpost.com': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://nytimes.com': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://spiegel.de': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}] Traceback (most recent call last): File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/bin/wayback-google-analytics", line 8, in sys.exit(main_entrypoint()) File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/wayback_google_analytics/main.py", line 182, in main_entrypoint asyncio.run(main(args)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/wayback_google_analytics/main.py", line 100, in main write_output(output_file, args.output, results) File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/wayback_google_analytics/output.py", line 67, in write_output codes_df = get_codes_df(results) File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/wayback_google_analytics/output.py", line 183, in get_codes_df codes_df.groupby("code") File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/pandas/core/frame.py", line 8872, in groupby return DataFrameGroupBy( File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1274, in init grouper, exclusions, obj = get_grouper( File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/pandas/core/groupby/grouper.py", line 1009, in get_grouper raise KeyError(gpr) KeyError: 'code'

Any idea what's going on? Many thanks!

msramalho commented 9 months ago

Thanks for flagging this.

The error is in this line: https://github.com/bellingcat/wayback-google-analytics/blob/525ed217e527a979ba5c7a70a208366e28a1b752/wayback_google_analytics/output.py#L183

by inspection, the KeyError can only occur if the code_list variable is empty []. full code for that is here

To replicate:

import pandas as pd
df = pd.DataFrame([])
df.groupby("code")
# --> KeyError "code"

This means the results did not retrieve any valid codes, I'll wait for @jclark1913 's input, but the fix should be about not doing the groupby opeartion on empty code results.

jclark1913 commented 9 months ago

Great catch here, @Tiptop4792 . I think @msramalho is spot on in his assessment - I'll step through the code to see if there are other KeyErrors when not finding codes and make a PR that fixes this issue.

Tiptop4792 commented 9 months ago

Thank you @msramalho & @jclark1913 for the quick fix - it works! 👏