enhancement: workbook_iterate and workbook_flatten

djay commented 2 years ago

I wrote two higher level functions that could be useful to others if included in your library?

def workbook_iterate(url, **selects):
    "generates combinations of workbooks from combinations of parameters, selects or filters"

def workbook_flatten(wb, date=None, **mappings):
    """return a single DataFrame from a workbook flattened according to mappings
    mappings is worksheetname=columns
    if columns is type str puts a single value into column
    if columns is type dict will map worksheet columns to defined dataframe columns
    if those column names are in turn dicts then the worksheet will be pivoted and the values mapped to columns
    e.g.
    worksheet1="Address", 
    worksheet2=dict(ws_phone="phone", ws_state="State"), 
    worksheet3=dict(ws_state=dict(NSW="State: New South Wales", ...))
    """
    # TODO: generalise what to index by and default value for index

The code and examples of how I'm using it

Used in combination you can reliably scrape lots of data with not too much code, at least in the case of similar to what I've used it for?

bertrandmartel commented 2 years ago

@djay Thank you, that's great work!

I'm very interested in workbook_iterate function since there are many usecases when we need to iterate the parameters/filters (server side rendering / get all region/county/province data etc...). This addition would greatly reduce boilerplate

workbook_flatten seems quite advanced, maybe too advanced for most people that will use this library but I may be wrong.

Do you think you can provide a PR with a sample usage for one or both of these feature ?

djay commented 2 years ago

I;ve only really used it on my one usecase for now.

For this the iterate and flatten work together to reduce the code to.

        dates = reversed(pd.date_range("2021-02-01", today() - relativedelta(hours=7)).to_pydatetime())
        for get_wb, idx_value in workbook_iterate(url, param_date=dates, D2_Province="province"):
            date, province = idx_value
            if province is None:
                continue
            province = get_province(province)
            if skip_valid(df, (date, province), allow_na):
                continue
            if (wb := get_wb()) is None:
                continue
            row = workbook_flatten(
                wb,
                date,
                D2_Vac_Stack={
                    "DAY(txn_date)-value": "Date",
                    "vaccine_plan_group-alias": {
                        "1": "1 Cum",
                        "2": "2 Cum",
                        "3": "3 Cum",
                    },
                    "SUM(vaccine_total_acm)-value": "Vac Given",
                },
                D2_Walkin="Cases Walkin",
                D2_Proact="Cases Proactive",
                D2_Prison="Cases Area Prison",
                D2_NonThai="Cases Imported",
                D2_New="Cases",
                D2_NewTL={
                    "AGG(stat_count)-alias": "Cases",
                    "DAY(txn_date)-value": "Date"
                },
                D2_Lab2={
                    "AGG(% ติดเฉลี่ย)-value": "Positive Rate Dash",
                    "DAY(txn_date)-value": "Date"
                },
                D2_Lab={
                    "AGG(% ติดเฉลี่ย)-alias": "Positive Rate Dash",
                    "ATTR(txn_date)-alias": "Date",
                },
                D2_Death="Deaths",
                D2_DeathTL={
                    "AGG(num_death)-value": "Deaths",
                    "DAY(txn_date)-value": "Date"
                },
            )

results in

Date,Province,Cases,Cases Area Prison,Cases Imported,Cases Proactive,Cases Walkin,Deaths,Hospitalized Severe,Positive Rate Dash,Tests,Vac Given 1 Cum,Vac Given 2 Cum,Vac Given 3 Cum
...
2021-09-30,Trang,85.0,0.0,0.0,0.0,85.0,0.0,0.0,,,674308.0,433842.0,26718.0
2021-09-30,Trat,76.0,2.0,0.0,0.0,74.0,0.0,91.0,,,268796.0,193600.0,8403.0
2021-09-30,Ubon Ratchathani,178.0,0.0,0.0,0.0,178.0,1.0,0.0,,,512693.0,332288.0,24226.0
2021-09-30,Udon Thani,123.0,0.0,0.0,0.0,123.0,2.0,0.0,,,496948.0,286579.0,16244.0
2021-09-30,Uthai Thani,25.0,0.0,0.0,1.0,24.0,0.0,1.0,,,124028.0,74468.0,5276.0
2021-09-30,Uttaradit,10.0,0.0,0.0,0.0,10.0,0.0,0.0,,,178581.0,111562.0,5184.0
2021-09-30,Yala,30.0,0.0,0.0,2.0,28.0,2.0,0.0,,,374607.0,227236.0,9466.0
2021-09-30,Yasothon,563.0,2.0,0.0,0.0,561.0,1.0,0.0,,,254452.0,150356.0,5982.0

Maybe you could point me to a more permanent and simpler workbook I could scrape and I can use that as an example instead? I think one thing I might have to do is generalise flatten to work for non timeseries data as it currently assumes this which makes it less useful. Know of a good example which is not indexed by something other than date?

bertrandmartel commented 2 years ago

@djay For something not indexed by date, maybe:

djay commented 2 years ago

yeah maybe the top5leagues one with a row per player.

Doesn't show off the merging of an embedded graph inside the workbook but chances are the usecase for that one is only going to be timeseries.

djay commented 2 years ago

@bertrandmartel maybe workbook_flatten would be more useful if it worked more automatically to try to return a single dataframe from one workbook. Then you can rename columns yourself after to clean it up. So in the example above it would be

df = wb.flatten(datatime.now())
df = df.rename({"D2_Vac_Stack: vaccine_plan_group-alias: 1": "Vac Given 1",...

However for that to work it needs to know what the index is for every table inside a workbook and assume they are the same index. And also that single value tables will have index value passed in. Im not yet sure where in the information is of what the index is for an internal plot.

djay commented 2 years ago

@bertrandmartel actually in that example it would never work for pivoting an internal table/plot. It wouldn't know which column to pivot on. Maybe the way to do that more simply would be have an exclude param on flatten and the user to do that pivot manually and combine themselves.

In addition the code I had dealt with combining internal plots and single values that represented the same data (but potentially different dates). So that would have to be done manually.

So the example would be

df = wb.flatten(datatime.now(), exclude=["D2_Vac_Stack"])
df = df.rename(columns=dict(D2_New="Cases", D2_NewTL="Cases2", D2_Death="Deaths",...))
df = df.combine_first(df["Cases2"].to_frame("Cases")).drop(columns="Cases2")
...
vac = wb.getWorksheet("D2_Vac_Stack").pivot_table(....
df = df.combine_first(vac)

I'm not sure if the end result saves more work or not...

ghost commented 2 years ago

@bertrandmartel inclusion of workbook_iterate would be very helpful especially if my understanding is correct and it would accept a parameter column and then iterate the values. Though I supose doing this is fairly easy, it would save a lot of time.

I am presently unable to use Python (due system restrictions) but it does seem like your library would be very helpful for getting scraped data into Microsoft Power BI as it supports running of Python scripts where a dataframe is the result. For the time being, many of your posts Stackoverflow have helped me to get a solution working in M (Power Query). Though it is much less advanced it does the job for now.

Thanks for you work.

bertrandmartel / tableau-scraping

enhancement: workbook_iterate and workbook_flatten #28