bertrandmartel / tableau-scraping

Tableau scraper python library. R and Python scripts to scrape data from Tableau viz
MIT License
126 stars 20 forks source link

Filter grabs incorrect data (Similar to Issue #6) #12

Closed jbadelson closed 3 years ago

jbadelson commented 3 years ago

First, thanks for putting this together. It's likely this problem is either user error on my part or weirdness in how the worksheet I'm trying to scrape is set up.

I'm trying to get filtered data from the https://analytics.la.gov/t/LDH/views/covid19_hosp_vent_reg/Hosp_vent_c. The options listed on the dropdown on the visualization are "(All)" and then the nine regions. Using getFilters() returns just the nine regions and when using setFilters() the data returned is for the prior region in the index.

For example. This code:

url = 'https://analytics.la.gov/t/LDH/views/covid19_hosp_vent_reg/Hosp_vent_c'
ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()

sheets = workbook.getSheets()
ws = ts.getWorksheet('Hospitalization and Ventilator Usage')
wb = ws.setFilter('Region', '2 - Baton Rouge')
regionWs = wb.getWorksheet('Hospitalization and Ventilator Usage')
print(regionWs.data)

Returns the results for '1 - New Orleans' which is in the prior index position.

Setting the filter to the first region ('1 - New Orleans') results in KeyError: 'dataDictionary' when trying to access its data.

The issue appears to be similar to one addressed in Issue #6, except with filters rather than selections. In this case, getSelectableItems() returns the data displayed on the chart when "(All)" is selected.

I'm using tableauscraper version 0.1.8.

bertrandmartel commented 3 years ago

@jbadelson Hello, the website seems to be only available in the US. I've managed to run it in repl.it (based in US) and extract all metadata for analysis

After investigating, the problem comes from the filter index, but my guess is that table.schema[].ordinal field seems to indicate the start filter index:

"schema": [{
    "caption": "Region",
    "collation": {
        "f": 0,
        "l": 4294967295
    },
    "column_class": 1,
    "dataType": "s",
    "family_name": "Extract",
    "fieldType": "N",
    "hidden": false,
    "name": ["sqlproxy.10q6ok10kzhj6r13vx7j91uvn4mv", "Region"],
    "ordinal": 1, <==================================================
    "role": "d"
}],

I didn't have any example showing that this field was important until now. The 3 following setFilter usecases have all ordinal set to 0:

https://replit.com/@bertrandmartel/TableauCovidNewHampshire https://replit.com/@bertrandmartel/TableauCovidSouthCarolina https://replit.com/@bertrandmartel/TableauFilter

I will implement the parsing of the ordinal field in schema object


I think it's more related to #5 because it deals with how filterJson parsing is implemented. Indexing issues are major challenges in this library since the indexing is sparsed in different places. For example, filter and select have completly different indexing mecanism:

bertrandmartel commented 3 years ago

@jbadelson I've released v0.1.9 which fix the index issue. But there is still something amiss when working with this url. I don't know if it's related to the filters, but it seems it doesn't return data (KeyError: 'dataDictionary') if I perform more than X setFilter in a row. For instance:

from tableauscraper import TableauScraper as TS

url = 'https://analytics.la.gov/t/LDH/views/covid19_hosp_vent_reg/Hosp_vent_c'
ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()

ws = ts.getWorksheet('Hospitalization and Ventilator Usage')

regions = next(iter([
    t["values"]
    for t in ws.getFilters()
    if t["column"] == "Region"
]))
print(regions)

for region in regions:
  print(region)
  wb = ws.setFilter('Region', region)
  ws = wb.getWorksheet('Hospitalization and Ventilator Usage')
  print(ws.data)

It fails at 5 - Southwest (or sometimes 8 - Monroe) but if I restart the session (with a new TS object) I get the result for 5 - Southwest so the filter query seems correct. Something is wrong with the session and I think it's related to how the dataDictionary is managed between calls (maybe tableau server specific). I still need to investigate but you can workaround by re-instantiating the TS object for the moment if you need to iterate all the regions

This seems particularly unusual since https://replit.com/@bertrandmartel/TableauCovidNewHampshire#main.py for NewHampshire works well with more than 10 setFilter in a row

bertrandmartel commented 3 years ago

@jbadelson Everything is fixed in v0.1.10. The following get all regions:

from tableauscraper import TableauScraper as TS

url = 'https://analytics.la.gov/t/LDH/views/covid19_hosp_vent_reg/Hosp_vent_c'
ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()

ws = ts.getWorksheet('Hospitalization and Ventilator Usage')

regions = next(iter([
    t["values"]
    for t in ws.getFilters()
    if t["column"] == "Region"
]))
print(regions)

for region in regions:
  print(region)
  wb = ws.setFilter('Region', region)
  regionWs = wb.getWorksheet('Hospitalization and Ventilator Usage')
  print(regionWs.data)

repl.it: https://replit.com/@bertrandmartel/TableauCovid19Louisiana

Note that I've added a warning in case the response doesn't return any data. In this case, it's normal behaviour but most of the time, it would mean that something is wrong in the input or that there is a bug

jbadelson commented 3 years ago

Works perfectly now, thank you so much!

With regard to the sessions issue: I'm guessing that's something specifically related to how this agency set up their server. I hadn't thought about this until just now, but this chart and others hosted on the same server will often reset themselves when switching between filters in a browser.

Thank you again for coming up with a solution for both issues and for doing it so quickly.