datalad-handbook / book

Sources for the DataLad handbook
http://handbook.datalad.org
Other
145 stars 55 forks source link

Data analytics automated extraction (ReadTheDocs) #897

Closed Manukapp closed 1 year ago

Manukapp commented 1 year ago

What is the goal?

Extract meaningful data from ReadTheDocs analytics on traffic and search queries in the Datalad Handbook. To achieve automation, it is suggested to use mechanize python library.

Solution Descriptions

Firstly, because all analytics are restrained to a 30 day window there is a need to create a mechanism to store the monthly results (i.e. Dec_2022) to track yearly progression.

Importantly, as mentioned in #889, two types of analytics are available. Traffic - composed of 4 columns: "Date", "Version", "Path" "Views" - and Search - 3 columns: "Date", "Query", "Total Results"

Secondly, here are a list of hypotheses about the data with potential approaches.

Search Analytics per 30 days

Traffic Analytics per 30 days

Future directions Correlations between traffic & search analytics? E.g. correlation between search query categories mentioned in above section with glossary page landing.

welcome[bot] commented 1 year ago

Welcome Banner (Image: CC-BY license, The Turing Way Community, & Scriberia. Zenodo. http://doi.org/10.5281/zenodo.3332808) Hi there, and welcome to the DataLad Handbook! :orange_book: :wave: Thank you for filing an issue. We're excited to have your input and welcome your idea! :blush: If you haven't done so already, please make sure you check out our Code of Conduct.

adswa commented 1 year ago

Those are nice aims for the scraped analytics data! If you're looking for a technical solution to generate those metrics from the data, take a look at pandas. Its a powerful python library often used for those data sciency tasks.

Manukapp commented 1 year ago
import mechanize

br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_redirect(mechanize.HTTPRedirectHandler)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("https://readthedocs.org/")
# follow second link with element text matching regular expression
br._factory.is_html = True
base_url = br.geturl()

print(base_url)

br.open(base_url + "accounts/login/")
print(br.title())

br.select_form(nr=0)
print(br.form)
br.form['login'] = "USERNAME"
br.form['login'] = "Password"
response1 = br.submit()
print(br.geturl())

Inside = "https://readthedocs.org/dashboard/datalad-handbook/traffic-analytics/"
br.open(Inside)
print(br.geturl())

br.click(label="Download all data")
Manukapp commented 1 year ago

https://stackoverflow.com/questions/1806238/mechanize-python-click-a-button

mih commented 1 year ago

The following code does downloading the traffic analytics data for me

import mechanize

class ReadTheDocs:
    traffic_url = "https://readthedocs.org/dashboard/datalad-handbook/traffic-analytics/"

    def __init__(self, username, password):
        br = mechanize.Browser()
        br.set_handle_robots(False)
        br.set_handle_redirect(mechanize.HTTPRedirectHandler)
        br.addheaders = [
            ('User-agent',
             'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')
        ]
        br.open("https://readthedocs.org")
        # follow second link with element text matching regular expression
        br._factory.is_html = True
        base_url = br.geturl()
        # login
        br.open(base_url + "/accounts/login/")
        # find first form
        br.select_form(nr=0)
        br.form['login'] = username
        br.form['password'] = password
        br.submit()
        # keep browser running
        self.br = br

    def get_traffic_analytics(self):
        self.br.open(ReadTheDocs.traffic_url)
        # the download is done via a form button. find the form
        self.br.select_form(nr=0)
        # press the buttom
        r = self.br.submit()
        # download CSV data as text
        return r.get_data()