Closed Manukapp closed 1 year ago
(Image: CC-BY license, The Turing Way Community, & Scriberia. Zenodo. http://doi.org/10.5281/zenodo.3332808) Hi there, and welcome to the DataLad Handbook! :orange_book: :wave: Thank you for filing an issue. We're excited to have your input and welcome your idea! :blush: If you haven't done so already, please make sure you check out our Code of Conduct.
Those are nice aims for the scraped analytics data! If you're looking for a technical solution to generate those metrics from the data, take a look at pandas. Its a powerful python library often used for those data sciency tasks.
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_redirect(mechanize.HTTPRedirectHandler)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("https://readthedocs.org/")
# follow second link with element text matching regular expression
br._factory.is_html = True
base_url = br.geturl()
print(base_url)
br.open(base_url + "accounts/login/")
print(br.title())
br.select_form(nr=0)
print(br.form)
br.form['login'] = "USERNAME"
br.form['login'] = "Password"
response1 = br.submit()
print(br.geturl())
Inside = "https://readthedocs.org/dashboard/datalad-handbook/traffic-analytics/"
br.open(Inside)
print(br.geturl())
br.click(label="Download all data")
The following code does downloading the traffic analytics data for me
import mechanize
class ReadTheDocs:
traffic_url = "https://readthedocs.org/dashboard/datalad-handbook/traffic-analytics/"
def __init__(self, username, password):
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_redirect(mechanize.HTTPRedirectHandler)
br.addheaders = [
('User-agent',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')
]
br.open("https://readthedocs.org")
# follow second link with element text matching regular expression
br._factory.is_html = True
base_url = br.geturl()
# login
br.open(base_url + "/accounts/login/")
# find first form
br.select_form(nr=0)
br.form['login'] = username
br.form['password'] = password
br.submit()
# keep browser running
self.br = br
def get_traffic_analytics(self):
self.br.open(ReadTheDocs.traffic_url)
# the download is done via a form button. find the form
self.br.select_form(nr=0)
# press the buttom
r = self.br.submit()
# download CSV data as text
return r.get_data()
What is the goal?
Extract meaningful data from ReadTheDocs analytics on traffic and search queries in the Datalad Handbook. To achieve automation, it is suggested to use mechanize python library.
Solution Descriptions
Firstly, because all analytics are restrained to a 30 day window there is a need to create a mechanism to store the monthly results (i.e. Dec_2022) to track yearly progression.
Importantly, as mentioned in #889, two types of analytics are available. Traffic - composed of 4 columns: "Date", "Version", "Path" "Views" - and Search - 3 columns: "Date", "Query", "Total Results"
Secondly, here are a list of hypotheses about the data with potential approaches.
Search Analytics per 30 days
what are the top 5-10 most popular search queries? Counting in "Query" most prominent search
Which are neuro-related? In "Query" column. Keywords: "mri" "neuroimaging" "neuro" "bids" "nifti" "dicom" "human connectome project" "hcp""openneuro" "osf"
Which are datalad command? In "Query" column. key words: "datalad" "save" "get" "delete" "create" "download-url" etc
Which are computational concepts? In "Query" column. KW: "containers" "metadata" "extension" "yoda" "ssh" "docker" "import" "zip" In "Query" column.
Which are cloud & storage related? In "Query" column. KW: "git" "annex" "zenodo" "ria" "multiple source" "multiple location" "google" "origin" "jupyter" "logging" "archive"
Which are about code functionalities? In "Query" column. KW: "python" "path" "import" "install" "text2git" "sys.path"
Traffic Analytics per 30 days
What are the top 10 most popular pages? Summing numbers in "Views" of same names in "Paths", and listing by greater to lesser
Total Handbook view count? Summing all numbers in "Views"
Most popular basics chapter? Summing all "/basics/" in "Paths" and listing by greater to lesser
Most popular intro chapter? Summing numbers in "Views" "/intro/" in "Paths" and listing by greater to lesser
Most popular beyondbasics chapter? Summing numbers in "Views"_ "/beyondbasics/" in "Paths"_ and listing by greater to lesser
Most popular usecases? Summing numbers in "Views" for "/usecases/" in "Paths" and listing by greater to lesser
Glossary total count Sum Views glossary in "Paths" only
Ratio of using previous handbook version relative to latest. Counting all non-"latest" (NL) version and dividing: NL/(latest-NL)
Future directions Correlations between traffic & search analytics? E.g. correlation between search query categories mentioned in above section with glossary page landing.