creativecommons / quantifying

quantify the size and diversity of the commons--the collection of works that are openly licensed or in the public domain
MIT License
23 stars 31 forks source link

[Feature] Refactoring for Improved Visualization Notebook Readability and Performance #80

Closed Paulooh007 closed 6 months ago

Paulooh007 commented 6 months ago

Problem

I'll like to refactor visualization_engineering.ipynb due to the following reasons.

  1. Notebook generates unnecessary warnings in cell outputs, affecting readability.

    C:\Users\brand\AppData\Local\Temp\ipykernel_20136\1045467410.py:43: UserWarning: FixedFormatter should only be used together with FixedLocator
      google_ax.set_xticklabels([-int(x) for x in google_ax.get_xticks().tolist()])
    C:\Users\brand\AppData\Local\Temp\ipykernel_20136\1045467410.py:44: UserWarning: FixedFormatter should only be used together with FixedLocator
      google_ax.set_yticklabels(['{:,}'.format(int(x)) for x in google_ax.get_yticks().tolist()])
  2. The presence of long dictionaries and lengthy functions detracts from the notebook's clarity. Here's an example

    def generate_diagram_one(
        row_data, license_name, milestone_unit, annot=True, graph_option=None
    ):
        google_time_all_series = pd.Series(row_data)
        google_time_all_series.index = google_time_all_series.index * -1 + 6
        google_time_all_series_figs = plt.subplots(figsize=(15, 8), dpi=80)
        google_time_all_toplot = google_time_all_series[::-1].cumsum()[1:]
        google_ax = google_time_all_series_figs[1]
        sns.lineplot(google_time_all_toplot, ax=google_time_all_series_figs[1])
        if annot:
            google_time_all_milestone = [
                find_closest_neighbor(google_time_all_toplot, i)
                for i in range(
                    min(google_time_all_toplot)
                    // (milestone_unit)
                    * (milestone_unit),
                    max(google_time_all_toplot),
                    milestone_unit,
                )
            ] + [google_time_all_toplot.index[-1]]
            sns.scatterplot(
                google_time_all_toplot.loc[google_time_all_milestone], ax=google_ax
            )
            if graph_option == "a":
                for i in google_time_all_milestone:
                    plt.annotate(
                        f"At {-i} month before: \nReach {google_time_all_toplot.loc[i]:,} Documents",
                        (
                            i - 6,
                            google_time_all_toplot.loc[i] + milestone_unit * 0.2,
                        ),
                        ha="center",
                        bbox=dict(
                            boxstyle="square,pad=0.3", fc="grey", alpha=0.5, lw=2
                        ),
                    )
            elif graph_option == "b" or graph_option == "c":
                for i in google_time_all_milestone:
                    if i == google_time_all_toplot.index[-1]:
                        plt.annotate(
                            f"At {-i} month before: \nReach {google_time_all_toplot.loc[i]:,} Documents",
                            (
                                i - 8,
                                google_time_all_toplot.loc[i]
                                + milestone_unit * 0.1,
                            ),
                            ha="center",
                            bbox=dict(
                                boxstyle="square,pad=0.3",
                                fc="grey",
                                alpha=0.5,
                                lw=2,
                            ),
                        )
                    else:
                        plt.annotate(
                            f"At {-i} month before: \nReach {google_time_all_toplot.loc[i]:,} Documents",
                            (
                                i - 8,
                                google_time_all_toplot.loc[i]
                                - milestone_unit * 0.1,
                            ),
                            ha="right",
                            bbox=dict(
                                boxstyle="square,pad=0.3",
                                fc="grey",
                                alpha=0.5,
                                lw=2,
                            ),
                        )
  3. The reliance on explicit loops for data manipulation leads to less efficient and less readable code. Here's one of many example.

    for country in google_country_data.index:
        if country in country_codes_data["name"].values:
            cur_country_code = country_codes_data[
                country_codes_data["name"] == country
            ]["alpha-3"].iloc[0]
            country_codes.append(cur_country_code)
        else:
            country_codes.append(None)

    which can be refactored to be:

    country_name_to_alpha3 = country_codes_data.set_index('name')['alpha-3'].to_dict()
    
    # Use the map function to vectorize the conversion process
    country_codes = google_country_data.index.map(lambda country: country_name_to_alpha3.get(country, None)).tolist()

Description

Implementing these changes will make the notebook more efficient, easier to maintain, and also reduces the length of the notebook.

Implementation

TimidRobot commented 6 months ago

Thank you for writing up this issue!

Ultimately, I expect this project won't have any Jupyter Notebooks per:

However, I think there's still value in refining the code:

  • Suppressing unnecessary warnings in cell outputs.

Yes, good

  • Moving long dictionaries and lengthy functions to a separate utilities file, which will help declutter the notebook, making it easier to navigate and understand.

(Strikethrough added) One of the strengths of Jupyter Notebooks is showing the narrative, code, and figures all together. This change would "hide" some of the code. I don't think it is worth pursuing.

  • Replacing explicit loops with vectorized operations, such as .map() and list comprehensions, to improve performance and conciseness.

Yes, good

Paulooh007 commented 6 months ago
  • Suppressing unnecessary warnings in cell outputs.
  • Moving long dictionaries and lengthy functions to a separate utilities file, which will help declutter the notebook, making it easier to navigate and understand.
  • Replacing explicit loops with vectorized operations, such as .map() and list comprehensions, to improve performance and conciseness.

Thanks, I'll submit a PR for this asap.