Data analytics automated extraction (ReadTheDocs)

Manukapp commented 1 year ago

What is the goal?

Extract meaningful data from ReadTheDocs analytics on traffic and search queries in the Datalad Handbook. To achieve automation, it is suggested to use mechanize python library.

Solution Descriptions

Firstly, because all analytics are restrained to a 30 day window there is a need to create a mechanism to store the monthly results (i.e. Dec_2022) to track yearly progression.

Importantly, as mentioned in #889, two types of analytics are available. Traffic - composed of 4 columns: "Date", "Version", "Path" "Views" - and Search - 3 columns: "Date", "Query", "Total Results"

Secondly, here are a list of hypotheses about the data with potential approaches.

Search Analytics per 30 days

what are the top 5-10 most popular search queries? Counting in "Query" most prominent search
Which are neuro-related? In "Query" column. Keywords: "mri" "neuroimaging" "neuro" "bids" "nifti" "dicom" "human connectome project" "hcp""openneuro" "osf"
Which are datalad command? In "Query" column. key words: "datalad" "save" "get" "delete" "create" "download-url" etc
Which are computational concepts? In "Query" column. KW: "containers" "metadata" "extension" "yoda" "ssh" "docker" "import" "zip" In "Query" column.
Which are cloud & storage related? In "Query" column. KW: "git" "annex" "zenodo" "ria" "multiple source" "multiple location" "google" "origin" "jupyter" "logging" "archive"
Which are about code functionalities? In "Query" column. KW: "python" "path" "import" "install" "text2git" "sys.path"

Traffic Analytics per 30 days

What are the top 10 most popular pages? Summing numbers in "Views" of same names in "Paths", and listing by greater to lesser
Total Handbook view count? Summing all numbers in "Views"
Most popular basics chapter? Summing all "/basics/" in "Paths" and listing by greater to lesser
Most popular intro chapter? Summing numbers in "Views" "/intro/" in "Paths" and listing by greater to lesser
Most popular beyondbasics chapter? Summing numbers in "Views"_ "/beyondbasics/" in "Paths"_ and listing by greater to lesser
Most popular usecases? Summing numbers in "Views" for "/usecases/" in "Paths" and listing by greater to lesser
Glossary total count Sum Views glossary in "Paths" only
Ratio of using previous handbook version relative to latest. Counting all non-"latest" (NL) version and dividing: NL/(latest-NL)

Future directions Correlations between traffic & search analytics? E.g. correlation between search query categories mentioned in above section with glossary page landing.

welcome[bot] commented 1 year ago

Welcome Banner (Image: CC-BY license, The Turing Way Community, & Scriberia. Zenodo. http://doi.org/10.5281/zenodo.3332808) Hi there, and welcome to the DataLad Handbook! :orange_book: :wave: Thank you for filing an issue. We're excited to have your input and welcome your idea! :blush: If you haven't done so already, please make sure you check out our Code of Conduct.

adswa commented 1 year ago

Those are nice aims for the scraped analytics data! If you're looking for a technical solution to generate those metrics from the data, take a look at pandas. Its a powerful python library often used for those data sciency tasks.

Manukapp commented 1 year ago

import mechanize

br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_redirect(mechanize.HTTPRedirectHandler)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("https://readthedocs.org/")
# follow second link with element text matching regular expression
br._factory.is_html = True
base_url = br.geturl()

print(base_url)

br.open(base_url + "accounts/login/")
print(br.title())

br.select_form(nr=0)
print(br.form)
br.form['login'] = "USERNAME"
br.form['login'] = "Password"
response1 = br.submit()
print(br.geturl())

Inside = "https://readthedocs.org/dashboard/datalad-handbook/traffic-analytics/"
br.open(Inside)
print(br.geturl())

br.click(label="Download all data")

Manukapp commented 1 year ago

https://stackoverflow.com/questions/1806238/mechanize-python-click-a-button

mih commented 1 year ago

The following code does downloading the traffic analytics data for me

import mechanize

class ReadTheDocs:
    traffic_url = "https://readthedocs.org/dashboard/datalad-handbook/traffic-analytics/"

    def __init__(self, username, password):
        br = mechanize.Browser()
        br.set_handle_robots(False)
        br.set_handle_redirect(mechanize.HTTPRedirectHandler)
        br.addheaders = [
            ('User-agent',
             'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')
        ]
        br.open("https://readthedocs.org")
        # follow second link with element text matching regular expression
        br._factory.is_html = True
        base_url = br.geturl()
        # login
        br.open(base_url + "/accounts/login/")
        # find first form
        br.select_form(nr=0)
        br.form['login'] = username
        br.form['password'] = password
        br.submit()
        # keep browser running
        self.br = br

    def get_traffic_analytics(self):
        self.br.open(ReadTheDocs.traffic_url)
        # the download is done via a form button. find the form
        self.br.select_form(nr=0)
        # press the buttom
        r = self.br.submit()
        # download CSV data as text
        return r.get_data()

datalad-handbook / book

Data analytics automated extraction (ReadTheDocs) #897

What is the goal?

Solution Descriptions