bertrandmartel / tableau-scraping

Tableau scraper python library. R and Python scripts to scrape data from Tableau viz
MIT License
130 stars 21 forks source link

get Integer value from _dict instead of real #24

Closed az-data-guru closed 3 years ago

az-data-guru commented 3 years ago

Considering this code block: `from tableauscraper import TableauScraper as TS import pandas as pd import time

data_list = [] url = "https://tableau.azdhs.gov/views/EMResourceBeds/InpatientBedUsageAvailability?%3Aembed=y&"

ts = TS() ts.loads(url) workbook = ts.getWorkbook()

for t in workbook.worksheets: data_list.append(t.data) print(data_list)`

TableauScraper is only offering the real (percentage in this case) number. How can I get the integer instead?

Thanks as always BM

bertrandmartel commented 3 years ago

@az-data-guru You could format the data like this to get the percentage value:

from tableauscraper import TableauScraper as TS
import pandas as pd

url = "https://tableau.azdhs.gov/views/EMResourceBeds/InpatientBedUsageAvailability"

ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()

for t in workbook.worksheets:
    df = t.data
    if "SUM(Number)-alias" in df:
        df['SUM(Number)-alias'] = pd.Series([
            "{0:.2f}%".format(val * 100)
            for val in df['SUM(Number)-alias']
        ], index=df.index)

        df['SUM(Number)-value'] = pd.Series([
            "{0:.2f}%".format(val * 100)
            for val in df['SUM(Number)-value']
        ], index=df.index)

    print(df)

which gives:

     Date Pulled-value Date Pulled-alias                        Med Surgbed-alias SUM(Number)-alias SUM(Number)-value
0           2021-07-18         7/18/2021  Inpatient Beds In Use by COVID Patients             9.40%             9.40%
1           2021-07-17         7/17/2021  Inpatient Beds In Use by COVID Patients             8.95%             8.95%
2           2021-07-16         7/16/2021  Inpatient Beds In Use by COVID Patients             8.60%             8.60%
3           2021-07-15         7/15/2021  Inpatient Beds In Use by COVID Patients             8.29%             8.29%
4           2021-07-14         7/14/2021  Inpatient Beds In Use by COVID Patients             7.97%             7.97%
...                ...               ...                                      ...               ...               ...
1419        2020-03-30         3/30/2020                 Inpatient Beds Available            33.90%            33.90%
1420        2020-03-29         3/29/2020                 Inpatient Beds Available            35.58%            35.58%
1421        2020-03-28         3/28/2020                 Inpatient Beds Available            33.13%            33.13%
1422        2020-03-27         3/27/2020                 Inpatient Beds Available            32.07%            32.07%
1423        2020-03-26         3/26/2020                 Inpatient Beds Available            32.39%            32.39%

In fact, the integer values are calculated in Javascript, the real values are the raw data retrieved from the API. In some dashboard, the worksheet also contains the integer value but it depends on the dashboard

However, the format is specified in the json data but it's not straightforward to parse:

"vizDataColumns": [{
    "fn": "[system:visual].[tuple_id]",
    "fnDisagg": "",
    "formatStrings": [],
    "isAutoSelect": false,
    "paneIndices": [0],
    "columnIndices": [0]
}, {
    "fn": "[federated.1w7kli510gv45110nyi201m2xlpg].[none:DatePulled:qk]",
    "fnDisagg": "",
    "localBaseColumnName": "[DatePulled]",
    "baseColumnName": "[federated.1w7kli510gv45110nyi201m2xlpg].[DatePulled]",
    "fieldCaption": "Date Pulled",
    "formatStrings": ["*dd/MM/yyyy", "fr_FR", "", "HH:mm:ss", "0", "1", "4", "0"],
    "datasourceCaption": "federated.1w7kli510gv45110nyi201m2xlpg",
    "dataType": "date",
    "fieldRole": "dimension",
    "isAutoSelect": false,
    "paneIndices": [0],
    "columnIndices": [1]
}, {
    "fn": "[federated.1w7kli510gv45110nyi201m2xlpg].[none:MED_SURGBED:nk]",
    "fnDisagg": "",
    "localBaseColumnName": "[MED_SURGBED]",
    "baseColumnName": "[federated.1w7kli510gv45110nyi201m2xlpg].[MED_SURGBED]",
    "fieldCaption": "Med Surgbed",
    "formatStrings": [],
    "datasourceCaption": "federated.1w7kli510gv45110nyi201m2xlpg",
    "dataType": "cstring",
    "stringCollation": {
        "name": "LEN_RUS_S2",
        "charsetId": 0
    },
    "fieldRole": "dimension",
    "isAutoSelect": true,
    "paneIndices": [0],
    "columnIndices": [2]
}, {
    "fn": "[federated.1w7kli510gv45110nyi201m2xlpg].[sum:Number:qk]",
    "fnDisagg": "",
    "localBaseColumnName": "[Number]",
    "baseColumnName": "[federated.1w7kli510gv45110nyi201m2xlpg].[Number]",
    "fieldCaption": "SOMME(Number)",
    "formatStrings": ["", "-1", "t", "", "", "", "0", ",", " ", "3;0"],
    "datasourceCaption": "federated.1w7kli510gv45110nyi201m2xlpg",
    "dataType": "integer",
    "aggregation": "sum",
    "fieldRole": "measure",
    "isAutoSelect": false,
    "paneIndices": [0],
    "columnIndices": [3]
}, {
    "fn": "[federated.1w7kli510gv45110nyi201m2xlpg].[pcto:sum:Number:qk:2]",
    "fnDisagg": "",
    "localBaseColumnName": "[Number]",
    "baseColumnName": "[federated.1w7kli510gv45110nyi201m2xlpg].[Number]",
    "fieldCaption": "SOMME(Number)",
    "formatStrings": ["p", "0", "f", "", "", "", "2", ",", " ", "3;0"],
    "datasourceCaption": "federated.1w7kli510gv45110nyi201m2xlpg",
    "dataType": "real",
    "aggregation": "sum",
    "fieldRole": "measure",
    "isAutoSelect": false,
    "paneIndices": [0],
    "columnIndices": [4]
}]

I think it would need to understand how the formatStrings is interpreted in Javascript:

"formatStrings": ["", "-1", "t", "", "", "", "0", ",", " ", "3;0"]

But if the code above is returning correct % value as expected, it may be sufficient for your needs

az-data-guru commented 3 years ago

I may not have been clear, instead of the percentage of inpatient I need the actual bed count. It exists in the data_dictionary.

For example, the 2021-07-18 entry in the dictionary has an integer value of 814 and a real value of .09404. I need the 814.

bertrandmartel commented 3 years ago

@az-data-guru this is fixed in v0.1.18. There was multiple columns with the same caption text. Thus, when the column name already exists, I've appended the function name to it.

from tableauscraper import TableauScraper as TS

url = "https://tableau.azdhs.gov/views/EMResourceBeds/InpatientBedUsageAvailability"

ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()

for t in workbook.worksheets:
    print(t.name)
    print(t.data)

Output

    Date Pulled-value Date Pulled-alias                        Med Surgbed-alias SUM(Number)-alias SUM(Number)-value SUM(Number)-[federated.1w7kli510gv45110nyi201m2xlpg].[pcto:sum:Number:qk:2]-alias
0          2021-07-18         7/18/2021  Inpatient Beds In Use by COVID Patients               814         0.0940497                                          0.0940497                               
1          2021-07-17         7/17/2021  Inpatient Beds In Use by COVID Patients               775         0.0895023                                          0.0895023                               
2          2021-07-16         7/16/2021  Inpatient Beds In Use by COVID Patients               745         0.0860277                                          0.0860277                               
3          2021-07-15         7/15/2021  Inpatient Beds In Use by COVID Patients               719         0.0829201                                          0.0829201                               
4          2021-07-14         7/14/2021  Inpatient Beds In Use by COVID Patients               689         0.0796716                                          0.0796716                               
..                ...               ...                                      ...               ...               ...                                                ...                               
232        2021-05-05          5/5/2021                 Inpatient Beds Available               915          0.106457                                           0.106457                               
233        2021-05-04          5/4/2021                 Inpatient Beds Available              1061          0.123473                                           0.123473                               
234        2021-05-03          5/3/2021                 Inpatient Beds Available              1120          0.130354                                           0.130354                               
235        2021-05-02          5/2/2021                 Inpatient Beds Available              1093          0.127464                                           0.127464                               
236        2021-05-01          5/1/2021                 Inpatient Beds Available              1034          0.120471                                           0.120471                               

[237 rows x 6 columns]
az-data-guru commented 3 years ago

Thank you!