bertrandmartel / tableau-scraping

Tableau scraper python library. R and Python scripts to scrape data from Tableau viz
MIT License
126 stars 20 forks source link

How to scrape the interactive map? #5

Closed jluo41 closed 3 years ago

jluo41 commented 3 years ago

Hi, thanks for this awesome toolkit. I used your library to scrape an interactive map data. But I encounter a problem of getting insufficient data.

I want to scrape every county's vaccine demographic information show in this page: https://www.dhs.wisconsin.gov/covid-19/vaccine-data.htm#day

I used the following code to get the sheet.

from tableauscraper import TableauScraper as TS
url = 'https://bi.wisconsin.gov/t/DHS/views/VaccinesAdministeredtoWIResidents/VaccinatedWisconsin-County'
ts = TS()
ts.loads(url)
dashboard = ts.getDashboard()
state = 'Wisconsin'
for t in dashboard.worksheets:
    #show worksheet name
    print(f"WORKSHEET NAME : {t.name}")
    #show dataframe for this worksheet
    # print(t.data)
    t.data.to_csv('TMP/'+state + '_' + t.name.replace('/', '_') + '.csv')

It returned a sheet named : Wisconsin_Race vax_unvax county.csv. However, this sheet only contains the total state level information. How could I get each county's information by your package?

Thanks.

bertrandmartel commented 3 years ago

@floydluo Hello, on the tableau dashboard, I didn't find race per county, only for the state. Maybe, it's just not there in the dashboard

jluo41 commented 3 years ago

Hi, it's me again.

If run the above code, the race worksheet only returns the state level race data. However, if we open the link in chrome and click the county in the map, the dashboard will be updated to that county's race vaccine information. I have tried to simulated the POST method to send tableau's /select URL to get the data. Then I got a list of values without keys. So I tried to map the values to keys based on indexes, which was not hundred percent accurate.

Here is another state problem by running your code, for the state South Carolina:


from tableauscraper import TableauScraper as TS

url = "https://public.tableau.com/views/NCDHHS_COVID-19_Dashboard_Vaccinations/Summary"

ts = TS()
ts.loads(url)
dashboard = ts.getDashboard()

#show dropdown input name
inputNames = dashboard.getDropdownInputs()
print(inputNames)

#show dropdown values for a given input name
values = dashboard.getDropdownValues("County-Map")
print(values)

dashboard = dashboard.setDropdown("County-Map", "Alamance County")

Then it report the error:

['County-Map', 'Param.Metrics Map']
['North Carolina', 'Alamance County', 'Alexander County', 'Alleghany County', 'Anson County', 'Ashe County', 'Avery County', 'Beaufort County', 'Bertie County', 'Bladen County', 'Brunswick County', 'Buncombe County', 'Burke County', 'Cabarrus County', 'Caldwell County', 'Camden County', 'Carteret County', 'Caswell County', 'Catawba County', 'Chatham County', 'Cherokee County', 'Chowan County', 'Clay County', 'Cleveland County', 'Columbus County', 'Craven County', 'Cumberland County', 'Currituck County', 'Dare County', 'Davidson County', 'Davie County', 'Duplin County', 'Durham County', 'Edgecombe County', 'Forsyth County', 'Franklin County', 'Gaston County', 'Gates County', 'Graham County', 'Granville County', 'Greene County', 'Guilford County', 'Halifax County', 'Harnett County', 'Haywood County', 'Henderson County', 'Hertford County', 'Hoke County', 'Hyde County', 'Iredell County', 'Jackson County', 'Johnston County', 'Jones County', 'Lee County', 'Lenoir County', 'Lincoln County', 'Macon County', 'Madison County', 'Martin County', 'Mcdowell County', 'Mecklenburg County', 'Mitchell County', 'Montgomery County', 'Moore County', 'Nash County', 'New Hanover County', 'Northampton County', 'Onslow County', 'Orange County', 'Pamlico County', 'Pasquotank County', 'Pender County', 'Perquimans County', 'Person County', 'Pitt County', 'Polk County', 'Randolph County', 'Richmond County', 'Robeson County', 'Rockingham County', 'Rowan County', 'Rutherford County', 'Sampson County', 'Scotland County', 'Stanly County', 'Stokes County', 'Surry County', 'Swain County', 'Transylvania County', 'Tyrrell County', 'Union County', 'Vance County', 'Wake County', 'Warren County', 'Washington County', 'Watauga County', 'Wayne County', 'Wilkes County', 'Wilson County', 'Yadkin County', 'Yancey County']
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-87-2e4d4aabf8bf> in <module>
     17 print(values)
     18 
---> 19 dashboard = dashboard.setDropdown("County-Map", "Alamance County")

~/opt/miniconda3/envs/ox/lib/python3.9/site-packages/tableauscraper/TableauDashboard.py in setDropdown(self, inputName, value)
    113             self._scraper, parameterNames[0], value
    114         )
--> 115         return tableauscraper.dashboard.getWorksheetsCmdResponse(self._scraper, r)

~/opt/miniconda3/envs/ox/lib/python3.9/site-packages/tableauscraper/dashboard.py in getWorksheetsCmdResponse(TS, data)
     91     output = []
     92     for selectedZone in zonesWithWorksheet:
---> 93         frameData = utils.getWorksheetCmdResponse(selectedZone, dataFull)
     94 
     95         if frameData is None:

~/opt/miniconda3/envs/ox/lib/python3.9/site-packages/tableauscraper/utils.py in getWorksheetCmdResponse(selectedZone, dataFull)
    252                 ]
    253             if len(index["aliasIndices"]) > 0:
--> 254                 frameData[f'{index["fieldCaption"]}-alias'] = [
    255                     onDataValue(it, t, cstring) for it in index["aliasIndices"]
    256                 ]

~/opt/miniconda3/envs/ox/lib/python3.9/site-packages/tableauscraper/utils.py in <listcomp>(.0)
    253             if len(index["aliasIndices"]) > 0:
    254                 frameData[f'{index["fieldCaption"]}-alias'] = [
--> 255                     onDataValue(it, t, cstring) for it in index["aliasIndices"]
    256                 ]
    257 

~/opt/miniconda3/envs/ox/lib/python3.9/site-packages/tableauscraper/utils.py in onDataValue(it, value, cstring)
    157 
    158 def onDataValue(it, value, cstring):
--> 159     return value[it] if (it >= 0) else cstring[abs(it) - 1]
    160 
    161 

IndexError: list index out of range

It seems that the code cannot match values with keys. Do you have some suggestions?

Thank you,

Junjie

bertrandmartel commented 3 years ago

@floydluo There is a bug in the library

For the 1st case you can use the following to get the race data for a specific county in the Map worksheet

from tableauscraper import TableauScraper as TS

url = "https://bi.wisconsin.gov/t/DHS/views/VaccinesAdministeredtoWIResidents/VaccinatedWisconsin-County"
ts = TS()
ts.loads(url)

dashboard = ts.getWorksheet("Map").select("County", "Waukesha County")
print(dashboard.getWorksheet("Race vax/unvax county").data)

Note that the following will return same error as above, I have to investigate this :

from tableauscraper import TableauScraper as TS

url = "https://bi.wisconsin.gov/t/DHS/views/VaccinesAdministeredtoWIResidents/VaccinatedWisconsin-County"

ts = TS()
ts.loads(url)
dashboard = ts.getDashboard()
for t in dashboard.worksheets:
    print(f"WORKSHEET NAME : {t.name}")

# show selectable columns
columns = ts.getWorksheet("Map").getSelectableColumns()
print(columns)

# show values by column name
values = ts.getWorksheet("Map").getValues("County")
print(values)

# select that value
dashboard = ts.getWorksheet("Map").select("County", "Waukesha County")
print(dashboard.getWorksheet("Race vax/unvax county").data)

The 2nd case error may be related to the former one, but I have to check

bertrandmartel commented 3 years ago

@floydluo I've narrowed down the problem. The data dictionary that is retrieved from the dashboard looks like :

{
   "0": {
       "cstring": [....],
       .....
   }
}

But when we perform the select request, sometimes it returns the whole data + the additional one:

{
   "0": {
       "cstring": [....],
       .....
   },
  "1": {
       "cstring": [....],
       .....
   }
}

And sometimes it only returns the missing data:

{
  "1": {
       "cstring": [....],
       .....
   }
}

As the indexing is extending all the objects in the data dictionary, there is an index error.

I would need to record the data between requests. I need to perform more tests to see if it has impact when performing multiple select requests in the same session

bertrandmartel commented 3 years ago

@floydluo This is fixed in v0.0.8

1st usecase: https://replit.com/@bertrandmartel/TableauCovidWisconsin

from tableauscraper import TableauScraper as TS

url = "https://bi.wisconsin.gov/t/DHS/views/VaccinesAdministeredtoWIResidents/VaccinatedWisconsin-County"

ts = TS()
ts.loads(url)

worksheet = ts.getWorksheet("Map")

dashboard = worksheet.select("County", "Waukesha County")
print(dashboard.getWorksheet("Race vax/unvax county").data)

dashboard = worksheet.select("County", "Forest County")
print(dashboard.getWorksheet("Race vax/unvax county").data)

2nd usecase: https://replit.com/@bertrandmartel/TableauCovidNCDHHS

from tableauscraper import TableauScraper as TS

url = "https://public.tableau.com/views/NCDHHS_COVID-19_Dashboard_Vaccinations/Summary"

ts = TS()
ts.loads(url)
dashboard = ts.getDashboard()

countyDashboard = dashboard.setDropdown("County-Map", "Alamance County")
print(countyDashboard.getWorksheet("Total Card Summary").data)

metricDashboard = dashboard.setDropdown(
    "Param.Metrics Map",
    "People Fully Vaccinated"
)
print(metricDashboard.getWorksheet("Total Card Summary").data)
jluo41 commented 3 years ago

@bertrandmartel That's great! Thank you so much. I will look your source code to further understand it.

bertrandmartel commented 3 years ago

@floydluo Hello I've just fixed an issue regarding the same data recording issue. You may want to upgrade to v0.0.9 if needed

jluo41 commented 3 years ago

@bertrandmartel Hello, thanks for your great work. It works well for Wisconsin. The selection function is perfect to deal with this state.

However, when I turn to the South Carolina, I find the selection function doesn't work. It will return the same df when selecting different counties.

For example

from tableauscraper import TableauScraper as TS

url = 'https://public.tableau.com/views/COVIDVaccineDashboard/RECIPIENTVIEW'
ts = TS()
ts.loads(url)
county1 = 'Greenville'
dashboard = ts.getWorksheet("Vaccine Map By SC residents PEOPLE").select('Recipient County for maps', county1)
df1 = dashboard.getWorksheet('Final Age xSex x Race REC').data

url = 'https://public.tableau.com/views/COVIDVaccineDashboard/RECIPIENTVIEW'
ts = TS()
ts.loads(url)
county2 = 'Horry'
dashboard = ts.getWorksheet("Vaccine Map By SC residents PEOPLE").select('Recipient County for maps', county2)
df2 = dashboard.getWorksheet('Final Age xSex x Race REC').data

Here df1 and df2 are exactly the same.

Besides, there is also a problem in dealing with Ohio data (I have scraped Ohio by simulating tableau's API with my dirty codes, just to share this problem with you).

Ohio needs to first Select the county and then Drop set the value Race. However, the drop method didn't work here.

You can try the following code:

from tableauscraper import TableauScraper as TS

url = "https://public.tableau.com/views/VaccineAdministrationMetricsDashboard/PublicCountyDash"
ts = TS()
ts.loads(url)

dashboard = ts.getWorksheet("New Map").select("county", 'Wyandot')

dashboard = dashboard.setDropdown('Key Metrics', "Race")

df = dashboard.getWorksheetname('New Demographics').data

Here df is still the Age table.

Thanks!

bertrandmartel commented 3 years ago

Hello, for the 1st case, is the chart you are working with is the one titled Rate of Vaccine Recipients by County per 10,000 Residents ?

jluo41 commented 3 years ago

Hi, @bertrandmartel, the chart I am interested in is this one, which contains the race information.

image

In the code, the associated worksheet is Final Age xSex x Race REC.

All sheets are as follows:


['Doses Admin KPI number',
 'Interpretation',
 'DOSES SC RES',
 'Gauge Partial',
 'Final Age xSex x Race REC',
 'Custom Axis REC',
 'Ethnicity Pie REC',
 "Average Age Table REC'D",
 'People for 1.1',
 'Vaccine Map By SC residents PEOPLE',
 'Vaccine Map By SC residents PEOPLE (3)',
 'County Table People Sc Residents',
 'People for 1.1 Filter',
 'title 1.1',
 '1/2 Doses FOR 1.1 (2)',
 '1/2 Doses FOR 1.1 (3)',
 '% of Phase 1A',
 'Gauge 1 Text box',
 'Gauge Compllete',
 'Gauge 2 Text Box',
 'People Completed for 1.1',
 'Census link',
 '% of Doses Admin SC to SC residents']
bertrandmartel commented 3 years ago

Oh I see, so this is not when you click on a county on the map that it changes the data. But when using the dropdown. In fact, I called it setDropDown but it's actually a /set-parameter-value API call

In this case it calls /categorical-filter-by-index which is not implemented

jluo41 commented 3 years ago

OK, got it. And it is also hard to simulate the /categorical-filter-by-index in python. I tried but failed. Instead, I can try the zip code selection, which can be clicked in the map. I will try later and report the results to you.

Thanks for your great work.

bertrandmartel commented 3 years ago

For Ohio, the problem is just that the select doesn't return back the dropdown list values (fieldCaption with isAutoSelect). This is not really a bug, you can use a distinct variable for each dashboard:

from tableauscraper import TableauScraper as TS

url = "https://public.tableau.com/views/VaccineAdministrationMetricsDashboard/PublicCountyDash"
ts = TS()
ts.loads(url)
dashboard = ts.getDashboard()

countryDashboard = ts.getWorksheet("New Map").select("county", 'Wyandot')

raceDashboard = dashboard.setDropdown('Key Metrics', "Race")

df = raceDashboard.getWorksheet('New Demographics').data
print(df)

Note that I use dashboard.setDropdown and not countryDashboard.setDropdown.

I'm not sure if I could record the presModel between each call, this would be hard to code/maintain. For now, using separate dashboard object is sufficient I think

The code above is supposed to work but it doesn't, there is a bug with the index of tuple which is easy to fix

This is first time I've noticed a paneIndices and columnIndices of length 2:

{
'fn': '[system:visual].[tuple_id]',
'fnDisagg': '',
'formatStrings': [],
'isAutoSelect': False,
'paneIndices': [0, 1],
'columnIndices': [0, 0]
}

This means that this feature targets both index (0,0) and index (1,0) but in the original code I take only first index (0,0) which is not correct in this case. This is annoying because it means that for a single field you could have multiple set of values. I will see if that's easy to drop the bad index and choose the correct one

bertrandmartel commented 3 years ago

Ok the Ohio issue is fixed, I will try to deal with /categorical-filter-by-index

bertrandmartel commented 3 years ago

I've investigated for the category filter, the filter values are located in the info JSON (not data), a custom parsing is needed. The values are indexed from 0 (those are not like the select values that are indexed using a dynamic list of tupleIds).

This should be easy to implement. Tomorrow (sunday), I will release a new version with all the fix

jluo41 commented 3 years ago

Hi, @bertrandmartel, the code can handle Ohio now. For the South Carolina, I have tried to select the Zipcode instead of the County, but it also returns the same results. Maybe we have to turn to category filter. I appreciate your quick response to the issue and I do learn a lot from your project. Cheers!

For my own experience of simulating category filter, I can't compose the request properly and only get bad response with one short line. Look forward to your methods.

bertrandmartel commented 3 years ago

@floydluo I've made a release which implements the filter API (v0.1.0). This release also renamed a bunch of methods in order to keep things more close to the underlying Tableau API.

Additional method are: getFilters() and setFilter on the worksheet class

from tableauscraper import TableauScraper as TS

url = "https://public.tableau.com/views/VaccineAdministrationMetricsDashboard/PublicCountyDash"
ts = TS()
ts.loads(url)
wb = ts.getWorkbook()
ws = ts.getWorksheet("New Map")

counties = [
    t["values"] for t in ws.getSelectableItems()
    if t["column"] == "county"
][0]
print(counties)

wb.setParameter('Key Metrics', "Race")

for county in counties:
    print(county)
    wb = ws.select("county", county)
    demograhics = wb.getWorksheet('New Demographics')
    print(demograhics.data)
from tableauscraper import TableauScraper as TS
import pandas as pd
url = 'https://public.tableau.com/views/COVIDVaccineDashboard/RECIPIENTVIEW'
ts = TS()
ts.loads(url)

ws = ts.getWorksheet("Vaccine Map By SC residents PEOPLE")
print(ws.data)

# show filter columns and values
filters = ws.getFilters()
print(filters)

# get all counties values
counties = [
    t["values"]
    for t in filters
    if t["column"] == "[Recipient County for maps]"
][0]

for county in counties:
    print(county)
    wb = ws.setFilter('[Recipient County for maps]', county)
    countyWs = wb.getWorksheet("Final Age xSex x Race REC")
    print(countyWs.data)
from tableauscraper import TableauScraper as TS

url = "https://bi.wisconsin.gov/t/DHS/views/VaccinesAdministeredtoWIResidents/VaccinatedWisconsin-County"

ts = TS()
ts.loads(url)

worksheet = ts.getWorksheet("Map")

dashboard = worksheet.select("County", "Waukesha County")
print(dashboard.getWorksheet("Race vax/unvax county").data)

dashboard = worksheet.select("County", "Forest County")
print(dashboard.getWorksheet("Race vax/unvax county").data)
from tableauscraper import TableauScraper as TS

url = "https://public.tableau.com/views/NCDHHS_COVID-19_Dashboard_Vaccinations/Summary"

ts = TS()
ts.loads(url)
dashboard = ts.getWorkbook()

countyDashboard = dashboard.setParameter("County-Map", "Alamance County")
print(countyDashboard.getWorksheet("Total Card Summary").data)

metricDashboard = dashboard.setParameter(
    "Param.Metrics Map",
    "People Fully Vaccinated"
)
print(metricDashboard.getWorksheet("Total Card Summary").data)
jluo41 commented 3 years ago

Great work! I will follow it up. Thank you.

bertrandmartel commented 3 years ago

@floydluo you're welcome. I've found this NYC Tableau vaccine tracker and I've setup a repo scraping the data every day with Github Actions : https://github.com/bertrandmartel/covid19-nyc-vaccine-tracker

I've implemented the goToSheet api command for this usecase, in case you need it later, the usage is here

also this one for Idaho uses the goToSheet command (big dashboard button)

jluo41 commented 3 years ago

Hi, @bertrandmartel, great work!

Here I just want to make clear about the conceptions used in Tableau and Tableau Scraper. What the relationships among Dashboard, Workbook, Worksheet? When we initialized a ts, what does it refer to as? What is the difference between SetParameter and SelectValue? Does Workbook SelectValue and Worksheet SetParameter?

Thanks!

bertrandmartel commented 3 years ago

@floydluo Hello, dashboard is the same as workbook. ts is just TableauScraper object that is used to extract the initial data. SetParameter calls set-parameter-value API here and SelectValue calls select API here

The setParameter doesn't depends on the worksheet, see the payload, it just sets parameterName/parameterValue

The worksheet depends on the worksheet, see the payload

jluo41 commented 3 years ago

@bertrandmartel Hi, sorry to bother again. But the code for South Caroline is dead again. It seems only one county is in the [Recipient County for maps] now.

bertrandmartel commented 3 years ago

@floydluo I see, I've identified the problem. It seems the quickFilterDisplay is irrelevant and I should detect filtersJson instead which is another complex objects with schema fields and tuples fields

bertrandmartel commented 3 years ago

@floydluo this is fixed in v1.0.3, the column name is a bit different :

https://replit.com/@bertrandmartel/TableauCovidSouthCarolina

from tableauscraper import TableauScraper as TS

url = 'https://public.tableau.com/views/COVIDVaccineDashboard/RECIPIENTVIEW'
ts = TS()
ts.loads(url)

ws = ts.getWorksheet("Vaccine Map By SC residents PEOPLE")
print(ws.data)

# show filter columns and values
filters = ws.getFilters()
print(filters)

# get all counties values
counties = next(iter([
    t["values"]
    for t in filters
    if t["column"] == "Recipient County for maps"
]))
print(counties)

for county in counties:
    print(county)
    wb = ws.setFilter('Recipient County for maps', county)
    countyWs = wb.getWorksheet("Final Age xSex x Race REC")
    print(countyWs.data)
jluo41 commented 3 years ago

Got it! Thanks. It works perfectly.