[Feature] Refactoring for Improved Visualization Notebook Readability and Performance

Paulooh007 commented 6 months ago

Problem

I'll like to refactor visualization_engineering.ipynb due to the following reasons.

Notebook generates unnecessary warnings in cell outputs, affecting readability.

C:\Users\brand\AppData\Local\Temp\ipykernel_20136\1045467410.py:43: UserWarning: FixedFormatter should only be used together with FixedLocator
  google_ax.set_xticklabels([-int(x) for x in google_ax.get_xticks().tolist()])
C:\Users\brand\AppData\Local\Temp\ipykernel_20136\1045467410.py:44: UserWarning: FixedFormatter should only be used together with FixedLocator
  google_ax.set_yticklabels(['{:,}'.format(int(x)) for x in google_ax.get_yticks().tolist()])

The presence of long dictionaries and lengthy functions detracts from the notebook's clarity. Here's an example

def generate_diagram_one(
    row_data, license_name, milestone_unit, annot=True, graph_option=None
):
    google_time_all_series = pd.Series(row_data)
    google_time_all_series.index = google_time_all_series.index * -1 + 6
    google_time_all_series_figs = plt.subplots(figsize=(15, 8), dpi=80)
    google_time_all_toplot = google_time_all_series[::-1].cumsum()[1:]
    google_ax = google_time_all_series_figs[1]
    sns.lineplot(google_time_all_toplot, ax=google_time_all_series_figs[1])
    if annot:
        google_time_all_milestone = [
            find_closest_neighbor(google_time_all_toplot, i)
            for i in range(
                min(google_time_all_toplot)
                // (milestone_unit)
                * (milestone_unit),
                max(google_time_all_toplot),
                milestone_unit,
            )
        ] + [google_time_all_toplot.index[-1]]
        sns.scatterplot(
            google_time_all_toplot.loc[google_time_all_milestone], ax=google_ax
        )
        if graph_option == "a":
            for i in google_time_all_milestone:
                plt.annotate(
                    f"At {-i} month before: \nReach {google_time_all_toplot.loc[i]:,} Documents",
                    (
                        i - 6,
                        google_time_all_toplot.loc[i] + milestone_unit * 0.2,
                    ),
                    ha="center",
                    bbox=dict(
                        boxstyle="square,pad=0.3", fc="grey", alpha=0.5, lw=2
                    ),
                )
        elif graph_option == "b" or graph_option == "c":
            for i in google_time_all_milestone:
                if i == google_time_all_toplot.index[-1]:
                    plt.annotate(
                        f"At {-i} month before: \nReach {google_time_all_toplot.loc[i]:,} Documents",
                        (
                            i - 8,
                            google_time_all_toplot.loc[i]
                            + milestone_unit * 0.1,
                        ),
                        ha="center",
                        bbox=dict(
                            boxstyle="square,pad=0.3",
                            fc="grey",
                            alpha=0.5,
                            lw=2,
                        ),
                    )
                else:
                    plt.annotate(
                        f"At {-i} month before: \nReach {google_time_all_toplot.loc[i]:,} Documents",
                        (
                            i - 8,
                            google_time_all_toplot.loc[i]
                            - milestone_unit * 0.1,
                        ),
                        ha="right",
                        bbox=dict(
                            boxstyle="square,pad=0.3",
                            fc="grey",
                            alpha=0.5,
                            lw=2,
                        ),
                    )

The reliance on explicit loops for data manipulation leads to less efficient and less readable code. Here's one of many example.

for country in google_country_data.index:
    if country in country_codes_data["name"].values:
        cur_country_code = country_codes_data[
            country_codes_data["name"] == country
        ]["alpha-3"].iloc[0]
        country_codes.append(cur_country_code)
    else:
        country_codes.append(None)

which can be refactored to be:

country_name_to_alpha3 = country_codes_data.set_index('name')['alpha-3'].to_dict()

# Use the map function to vectorize the conversion process
country_codes = google_country_data.index.map(lambda country: country_name_to_alpha3.get(country, None)).tolist()

Description

Suppressing unnecessary warnings in cell outputs.
Moving long dictionaries and lengthy functions to a separate utilities file, which will help declutter the notebook, making it easier to navigate and understand.
Replacing explicit loops with vectorized operations, such as .map() and list comprehensions, to improve performance and conciseness.

Implementing these changes will make the notebook more efficient, easier to maintain, and also reduces the length of the notebook.

Implementation

[x] I would be interested in implementing this feature.

TimidRobot commented 6 months ago

Thank you for writing up this issue!

Ultimately, I expect this project won't have any Jupyter Notebooks per:

22

However, I think there's still value in refining the code:

Suppressing unnecessary warnings in cell outputs.

Yes, good

~~Moving long dictionaries and lengthy functions to a separate utilities file, which will help declutter the notebook, making it easier to navigate and understand.~~

(Strikethrough added) One of the strengths of Jupyter Notebooks is showing the narrative, code, and figures all together. This change would "hide" some of the code. I don't think it is worth pursuing.

Replacing explicit loops with vectorized operations, such as .map() and list comprehensions, to improve performance and conciseness.

Yes, good

Paulooh007 commented 6 months ago

Suppressing unnecessary warnings in cell outputs.

Moving long dictionaries and lengthy functions to a separate utilities file, which will help declutter the notebook, making it easier to navigate and understand.

Replacing explicit loops with vectorized operations, such as .map() and list comprehensions, to improve performance and conciseness.

Thanks, I'll submit a PR for this asap.

creativecommons / quantifying