bertrandmartel / tableau-scraping

Tableau scraper python library. R and Python scripts to scrape data from Tableau viz
MIT License
126 stars 20 forks source link

Scraping of data rendered at the server-side #31

Closed janPesl closed 2 years ago

janPesl commented 2 years ago

Hello,

thanks a lot for the scraper! Good job!

I am trying to select all options in the dropdown filter following your docs.

from tableauscraper import TableauScraper as TS

url = "https://public.tableau.com/views/SknadertilForskningsrdet-oppsummering/Enkeltsknader"
ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()
ws = ts.getWorksheet("Prosjektoversikt")

filters = ws.getFilters()
print(filters)

Here I can find the relevant filter and the values: {'column': 'MDY(Søknadsfrist dato)', 'ordinal': 0, 'values': [20200212, 20200331, 20200422, 20200525, 20200527, 20200902, 20200916, 20201111, 20201118, 20210210, 20210217, 20210317, 20210512, 20210521, 20210915], 'globalFieldName': '[federated.1cjq0fj17kfn4b19erg02187mlqp].[md:Søknadsfrist dato:ok]'}

However, when I try to use these values for filtering, I keep getting the Error: value not in list.

wb = ws.setFilter('MDY(Søknadsfrist dato)', 20200212)

I have also tested feeding the value as a string '20200212' and even transcribing the value in the dropdown 'February 12, 2020'.

The only thing that has worked is the force_setFilter function from djay presented here #26 However, I cannot choose 'May 25, 2020' from the dropdown menu (not sure why it is only this one value).

Any suggestions how to set the filter would be much appreciated!

bertrandmartel commented 2 years ago

@janPesl Hello!

That's interesting, what is your python version ? OS ? It seems to fail at getting the index of integer 20200212 in a list Can you try this in python interpreter :

>>> test=[20200212, 20200331]
>>> test.index(20200212)

Does it work ?

There are cases for filters where the value is passed in json in the API so I must keep the original format (int) if possible. But I could add another property that holds string conversion and getting the index from it

janPesl commented 2 years ago

Thank you for your swift reply!

That has actually worked and it seems that I have managed to partially resolve this issue. At this stage, it boils down to a specific filter values. I am running python 3.9.7 on Win 10.

from tableauscraper import TableauScraper as TS
import tableauscraper

url = "https://public.tableau.com/views/SknadertilForskningsrdet-oppsummering/Enkeltsknader"

ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()
ws = ts.getWorksheet("Prosjektoversikt")

filters = ws.getFilters()
print(filters)

filters = [20200525, 20210210]

wb = ws.setFilter('MDY(Søknadsfrist dato)', 20200525)

And I am getting an error: 2021-10-03 13:44:03,321 - tableauScraper - WARNING - no data dictionary present in response

With the second filter value 20210210, I am not getting any data either but strangely enough, I am not getting the same error message. This behaviour is the same no matter if I am passing the filter in int form or string form using force_setFilter func.

Do you have any idea why these two particular filters wont work?

Thank you!

bertrandmartel commented 2 years ago

@janPesl I see it now, the data is rendered on server side: https://github.com/bertrandmartel/tableau-scraping#server-side-rendering

Server side rendering means that only images are shown on the screen, not the actual data which means we can't get the data unless we find a specific filter that works with client rendering only (for each value since dates use filter-delta in your dashboard). In your case, some filters are client side rendered and others are server side rendered which makes it difficult to use that filter (the date) to download all data.

Getting the data for server side rendering is always a challenge, checkout similar issues https://github.com/bertrandmartel/tableau-scraping/issues?q=is%3Aissue+label%3Aserver-side-rendering+

janPesl commented 2 years ago

Good catch!

It seems that once the amount of items is higher than some x, the worksheet us rendered on server side.

Is there a way to apply multiple filters at the same time, not step-wise?

If I start with wb = ws.setFilter('MDY(Søknadsfrist dato)', 20200525), it will be rendered at the server side and I wont get any data for further filtration. If I start with wb = ws.setFilter('Utlysning', 'ABALONYX AS') I will get an error message because ''ABALONYX AS' is not in list' of the default 'MDY(Søknadsfrist dato)' filter.

bertrandmartel commented 2 years ago

@janPesl yes it's possible using filterDelta=True but I've tried this and it still returns server side data when some dates are included

wb = ws.setFilter('MDY(Søknadsfrist dato)', [20200525, 20210210], filterDelta=True)
bertrandmartel commented 2 years ago

@janPesl I fail to understand how the tooltip values is retrieved

filter

It doesn't seem to be an image but actual data since I can copy/paste the data. In the tooltip above 24114 is present somewhere in the API result for 25 may. It may be a lead. The table tooltips have similar data

janPesl commented 2 years ago

@janPesl yes it's possible using filterDelta=True but I've tried this and it still returns server side data when some dates are included

wb = ws.setFilter('MDY(Søknadsfrist dato)', [20200525, 20210210], filterDelta=True)

I can retrieve chunks of the data this way:

wb = ws.setFilter('Sektor', 'Annet')
wb = ws.setFilter('MDY(Søknadsfrist dato)', 20200525)
df = wb.getWorksheet("Prosjektoversikt").data

The obvious limitation is that I cannot select any other values for the Sektor filter than ['Annet', 'Helseforetak', 'Instituttsektor', 'Næringsliv', 'Øvrige', 'UoH-sektor']. As a result, I wont get the data for Ukjent, Utlandet.

bertrandmartel commented 2 years ago

@janPesl I didn't find a way to get those data. The only thing I could achieve was to implement the render tooltip server API to get the tooltip html output for the sheet Antall prosjekter og søknader:

from tableauscraper import TableauScraper as TS
import pandas as pd

url = "https://public.tableau.com/views/SknadertilForskningsrdet-oppsummering/Enkeltsknader"

ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()
rootWs = ts.getWorksheet("Prosjektoversikt")

dates = [
    t
    for t in rootWs.getFilters()
    if t["column"] == "MDY(Søknadsfrist dato)"
][0]["values"]
print(dates)

for date in dates:
    print(f"get date {date}")
    wb = rootWs.setFilter('MDY(Søknadsfrist dato)', date)
    ws = wb.getWorksheet("Antall prosjekter og søknader")

    tooltipHtml = ws.renderTooltip(x=143, y=3)
    df = pd.read_html(tooltipHtml)[0]
    print(df)

    tooltipHtml = ws.renderTooltip(x=342, y=11)
    df = pd.read_html(tooltipHtml)[0]
    print(df)

repl.it: https://replit.com/@bertrandmartel/TableauForskningsr

It's possible to do the same with the tooltip in the table, but the tooltip doesn't mention the value from the column named Søkt beløp, mill kroner, only values from the other columns

tooltip

janPesl commented 2 years ago

My solution is not elegant, but it gets the job done without having to scrape the tooltip.

There were two tables that were rendered at the server side: the applications from 25 May 2020 and 10 Feb 2021. This solution has worked for both.

from tableauscraper import TableauScraper as TS
import pandas as pd
import tableauscraper
import json
import time
import requests

#filter function from issue 26
def force_setFilter(wb, ws_name, columnName, values):
    "setFilter but ignore the listed filter options. also gets around wrong ordinal value which makes index value incorrect"

    scraper = wb._scraper
    tableauscraper.api.delayExecution(scraper)
    ws = next(ws for ws in wb.worksheets if ws.name == ws_name)

    filter = next(
        {
            "globalFieldName": t["globalFieldName"],
        }
        for t in ws.getFilters()
        if t["column"] == columnName
    )

    payload = (
        ("dashboard", scraper.dashboard),
        ("globalFieldName", (None, filter["globalFieldName"])),
        ("qualifiedFieldCaption", (None, columnName)),
        ("membershipTarget", (None, "filter")),
        ("exclude", (None, "false")),
        ("filterValues", (None, json.dumps(values))),
        ("filterUpdateType", (None, "filter-replace"))
    )
    try:
        r = scraper.session.post(
            f'{scraper.host}{scraper.tableauData["vizql_root"]}/sessions/{scraper.tableauData["sessionid"]}/commands/tabdoc/dashboard-categorical-filter',
            files=payload,
            verify=scraper.verify
        )
        scraper.lastActionTime = time.time()

        if r.status_code >= 400:
            raise requests.exceptions.RequestException(r.content)
        resp = r.json()
        errors = [
            res['commandReturn']['commandValidationPresModel']['errorMessage']
            for res in resp['vqlCmdResponse']['cmdResultList']
            if not res['commandReturn'].get('commandValidationPresModel', {}).get('valid', True)
        ]
        if errors:
            wb._scraper.logger.error(str(", ".join(errors)))
            raise tableauscraper.api.APIResponseException(", ".join(errors))

        wb.updateFullData(resp)
        return tableauscraper.dashboard.getWorksheetsCmdResponse(scraper, resp)
    except ValueError as e:
        scraper.logger.error(str(e))
        return tableauscraper.TableauWorkbook(
            scraper=scraper, originalData={}, originalInfo={}, data=[]
        )
    except tableauscraper.api.APIResponseException as e:
        wb._scraper.logger.error(str(e))
        return tableauscraper.TableauWorkbook(
            scraper=scraper, originalData={}, originalInfo={}, data=[]
        )

# since the most data entries are concentrated in a single sector, I start by scraping all of the other sectors
pdList = []
url = "https://public.tableau.com/views/SknadertilForskningsrdet-oppsummering/Enkeltsknader"

ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()
ws = ts.getWorksheet("Prosjektoversikt")
wb = ws.setFilter('Sektor', 'Annet')  # I choose this sector as an init
wb = ws.setFilter('MDY(Søknadsfrist dato)', 20210210) # then I select the relevant date

#set the filter to all the sectors of interest
filtered = force_setFilter(workbook, "Prosjektoversikt", 'Sektor', ['Annet', 'Helseforetak', 'Instituttsektor', 'Næringsliv', 'Ukjent'])
df = filtered.getWorksheet("Prosjektoversikt").data
pdList.append(df)

#next I focus only on the remaining items in the UoH-sektor
#I create a list of lists with names of applicants, the goal is to keep the chunks under 800 items
filters = [['AHO', 'BI', 'Centre for the Science of Learning and Technology', 'CK', 'DMMH', 'FHS', 'Forskningsavdelingen PHS', 'HIB', 'HIHM', 'HIL', 'HIM', 'HINN', 'HIØ', 'HVL', 'HVO', 'MF', 'NHH', 'NIH'],
           ['Njord - Senter for studier av jordens fysikk', 'NMBU', 'NORD', 'NTNU', 'OSLOMET', 'PHS', 'SAMISKHS', 'Senter for bioinformatikk'],
           ['UIA', 'UIB', 'UIN', 'UIS', 'UIT', 'UNIS', 'USN', 'VID']
          ]

for filter in filters:
    url = "https://public.tableau.com/views/SknadertilForskningsrdet-oppsummering/Enkeltsknader"

    print(filter)

    ts = TS()
    ts.loads(url)
    workbook = ts.getWorkbook()
    ws = ts.getWorksheet("Prosjektoversikt")
    wb = ws.setFilter('Prosjektansvarlig', 'UIO') #I use UIO as an init here
    wb = ws.setFilter('MDY(Søknadsfrist dato)', 20210210)

    filtered = force_setFilter(workbook, "Prosjektoversikt", 'Prosjektansvarlig', filter)

    df = pd.DataFrame(filtered.getWorksheet("Prosjektoversikt").data)
    pdList.append(df)
    time.sleep(5)  #Not sure if this is necessary but I have had more reliable results with the delay

#Lastly, even the init value must be included
ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()
ws = ts.getWorksheet("Prosjektoversikt")
wb = ws.setFilter('Prosjektansvarlig', 'UIO')
wb = ws.setFilter('MDY(Søknadsfrist dato)', 20210210)

df = wb.getWorksheet("Prosjektoversikt").data
pdList.append(df)

result = pd.concat(pdList, ignore_index=True)
bertrandmartel commented 2 years ago

@janPesl I'm glad you've found out how to get all the data !

If you're using the latest release, there is no need of the force_setfilter function. You just need to get the worksheet from the previous workbook before running setFilter because the filter list is changing when you perform other filters.

Also, you can configure the delay in milliseconds between calls using ts = TS(delayMs=1000) # 1 second delay between calls

from tableauscraper import TableauScraper as TS
import pandas as pd

# since the most data entries are concentrated in a single sector, I start by scraping all of the other sectors
pdList = []
url = "https://public.tableau.com/views/SknadertilForskningsrdet-oppsummering/Enkeltsknader"

ts = TS()
ts.loads(url)
rootWs = ts.getWorksheet("Prosjektoversikt")
wb = rootWs.setFilter('Sektor', 'Annet')  # I choose this sector as an init
# then I select the relevant date
wb = rootWs.setFilter('MDY(Søknadsfrist dato)', 20210210)

# get the worksheet again since the filter list has changed
ws = wb.getWorksheet("Prosjektoversikt")
print(ws.getFilters())

# set the filter to all the sectors of interest
filtered = ws.setFilter('Sektor', [
    'Annet', 'Helseforetak', 'Instituttsektor', 'Næringsliv', 'Ukjent'])
df = filtered.getWorksheet("Prosjektoversikt").data
pdList.append(df)

# next I focus only on the remaining items in the UoH-sektor
# I create a list of lists with names of applicants, the goal is to keep the chunks under 800 items
filters = [['AHO', 'BI', 'Centre for the Science of Learning and Technology', 'CK', 'DMMH', 'FHS', 'Forskningsavdelingen PHS', 'HIB', 'HIHM', 'HIL', 'HIM', 'HINN', 'HIØ', 'HVL', 'HVO', 'MF', 'NHH', 'NIH'],
           ['Njord - Senter for studier av jordens fysikk', 'NMBU', 'NORD',
               'NTNU', 'OSLOMET', 'PHS', 'SAMISKHS', 'Senter for bioinformatikk'],
           ['UIA', 'UIB', 'UIN', 'UIS', 'UIT', 'UNIS', 'USN', 'VID']
           ]

for filter in filters:
    url = "https://public.tableau.com/views/SknadertilForskningsrdet-oppsummering/Enkeltsknader"

    print(filter)
    ts = TS(delayMs=1000)  # 1 second delay between calls
    ts.loads(url)
    ws = ts.getWorksheet("Prosjektoversikt")
    # I use UIO as an init here
    wb = ws.setFilter('Prosjektansvarlig', 'UIO')
    wb = ws.setFilter('MDY(Søknadsfrist dato)', 20210210)

    ws = wb.getWorksheet("Prosjektoversikt")

    filtered = ws.setFilter('Prosjektansvarlig', filter)

    df = pd.DataFrame(filtered.getWorksheet("Prosjektoversikt").data)
    pdList.append(df)

# Lastly, even the init value must be included
ts = TS()
ts.loads(url)
ws = ts.getWorksheet("Prosjektoversikt")
wb = ws.setFilter('Prosjektansvarlig', 'UIO')
wb = ws.setFilter('MDY(Søknadsfrist dato)', 20210210)

df = wb.getWorksheet("Prosjektoversikt").data
pdList.append(df)

result = pd.concat(pdList, ignore_index=True)

print(result)

repl.it: https://replit.com/@bertrandmartel/TableauSoknaderServerSideRendering

millievn commented 1 year ago

@janPesl I see it now, the data is rendered on server side: https://github.com/bertrandmartel/tableau-scraping#server-side-rendering

Server side rendering means that only images are shown on the screen, not the actual data which means we can't get the data unless we find a specific filter that works with client rendering only (for each value since dates use filter-delta in your dashboard). In your case, some filters are client side rendered and others are server side rendered which makes it difficult to use that filter (the date) to download all data.

Getting the data for server side rendering is always a challenge, checkout similar issues https://github.com/bertrandmartel/tableau-scraping/issues?q=is%3Aissue+label%3Aserver-side-rendering+

Data image urls are listed in the response. I try to scrape all the images but failed. The image request url has a geometry field and it’s too difficult to understand its logic.

By the way, is there any api or methods to scrape all the images or calculate its geometry ?