QuantEcon / lecture-python-intro

An Undergraduate Lecture Series for the Foundations of Computational Economics
https://intro.quantecon.org/
34 stars 18 forks source link

[inequality] Update exercise 3 #498

Closed longye-tian closed 22 hours ago

longye-tian commented 4 days ago

Hi Matt @mmcky ,

I have updated the exercise 3 of the inequality lecture using your code in #410 and add the simulation part below your solution.

What do you think about this version of the solution?

Best, Longye

netlify[bot] commented 4 days ago

Deploy Preview for taupe-gaufre-c4e660 ready!

Name Link
Latest commit 6e2d53ef9ccc1b2c0b43214f624dbd06887c2ab0
Latest deploy log https://app.netlify.com/sites/taupe-gaufre-c4e660/deploys/668774623d4e3f00080ac468
Deploy Preview https://deploy-preview-498--taupe-gaufre-c4e660.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

mmcky commented 4 days ago

thanks @longye-tian this is looking really good.

I had a Quick Look and I think it would be great to add timing comparisons between the two functions to demonstrate the how much quicker the vectorized code works.

longye-tian commented 4 days ago

thanks @longye-tian this is looking really good.

I had a Quick Look and I think it would be great to add timing comparisons between the two functions to demonstrate the how much quicker the vectorized code works.

Hi Matt @mmcky ,

I just updated the solution and main text by adding the %%time to show the computation time.

What do you think about this comparison?

Best, Longye

mmcky commented 4 days ago

@longye-tian just have an undefined label

longye-tian commented 4 days ago

@longye-tian just have an undefined label

  • [ ] /home/runner/work/lecture-python-intro/lecture-python-intro/lectures/inequality.md:1101: WARNING: undefined label: code:gini-coefficient

Thank you very much Matt!

I just added the label:

(code:gini-coefficient) = 

I hope this will pass the checks 👍

Best, Longye

github-actions[bot] commented 4 days ago

🚀 Deployed on https://668776713756a8db55160eea--taupe-gaufre-c4e660.netlify.app

mmcky commented 3 days ago

@longye-tian the kernel is dying when testing agains the google collab environment.

Would you mind running the notebook version of this PR on Google Collab to see if you can replicate this issue?

You can use jupytext to convert between the two formats.

https://manual.quantecon.org/writing/converting.html#myst-markdown-md-to-jupyter-notebook-ipynb

longye-tian commented 3 days ago

Hi Matt @mmcky ,

Thank you for this information. Google colab returns the following error:

For the code:

!pip install quantecon
import quantecon as qe

varlist = ['n_wealth',   # net wealth 
           't_income',   # total income
           'l_income']   # labor income

df = df_income_wealth

# create lists to store Gini for each inequality measure
results = {}

for var in varlist:
    # create lists to store Gini
    gini_yr = []
    for year in years:
        # repeat the observations according to their weights
        counts = list(round(df[df['year'] == year]['weights'] ))
        y = df[df['year'] == year][var].repeat(counts)
        y = np.asarray(y)

        rd.shuffle(y)    # shuffle the sequence

        # calculate and store Gini
        gini = qe.gini_coefficient(y)
        gini_yr.append(gini)

    results[var] = gini_yr

# Convert to DataFrame
results = pd.DataFrame(results, index=years)
results.to_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_label='year')

It returns

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[<ipython-input-20-a7cb0b0c20b0>](https://localhost:8080/#) in <cell line: 32>()
     30 # Convert to DataFrame
     31 results = pd.DataFrame(results, index=years)
---> 32 results.to_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_label='year')

4 frames
[/usr/local/lib/python3.10/dist-packages/pandas/io/common.py](https://localhost:8080/#) in check_parent_directory(path)
    598     parent = Path(path).parent
    599     if not parent.is_dir():
--> 600         raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
    601 
    602 

OSError: Cannot save file into a non-existent directory: '_static/lecture_specific/inequality'

What do you think about this issue?

Best, Longye

mmcky commented 2 days ago

thanks @longye-tian. That makes sense.

Do we import that data later in the lecture, or in another lecture? If not, we could remove the to_csv method otherwise

My thought is:

  1. we should have the code that generates that static data as a notebook in _static/lecture_specific/inequality/data.ipynb
  2. remove the to_csv method in the lecture code.
longye-tian commented 2 days ago

thanks @longye-tian. That makes sense.

Do we import that data later in the lecture, or in another lecture? If not, we could remove the to_csv method otherwise

My thought is:

  1. we should have the code that generates that static data as a notebook in _static/lecture_specific/inequality/data.ipynb
  2. remove the to_csv method in the lecture code.

Hi Matt @mmcky ,

I will look into it ! Thank you for the idea.

Best, Longye

longye-tian commented 2 days ago

Hi Matt @mmcky ,

Just updated the data.ipynb and the code.

What do you think?

Best, Longye

mmcky commented 2 days ago

thanks @longye-tian can you let me know where the data is imported? Is it later in the same lecture or in another lecture?

longye-tian commented 1 day ago

thanks @longye-tian can you let me know where the data is imported? Is it later in the same lecture or in another lecture?

Hi Matt @mmcky ,

In the same lecture, just after the save csv code, we have,

ginis = pd.read_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_col='year')

which also reports an error in the google colab. So I changed that code too.

Another issue I find in this lecture is when I use Google colab to run the updated version, some code such as

gini_coefficient(data.n_wealth.values)

in the exercise section, takes 10 minutes to run and later lead my colab page to crash after using all available RAM.

Best, Longye

mmcky commented 1 day ago

thanks @longye-tian let me dive into this PR after lunch. I suspect the initial thought was to fetch the raw data from github directly (and there is probabably a no-execute tag on the data generating cell. I'll review this.

mmcky commented 1 day ago

@longye-tian I have made some adjustments and simplifications now that the data.ipynb has been committed. The code block that contained the data generating code had skip-execution as a tag but google collab doesn't understand that context. So I have switched to a download role.

This PR won't execute as the data is not yet available on github. But if you are happy with everything else I can merge and test.

mmcky commented 1 day ago

@longye-tian can you test out this latest version of of the lecture on google collab. The jupyter kernel is dying on our test but not sure what would be driving that now. Thanks @longye-tian

longye-tian commented 1 day ago

@longye-tian can you test out this latest version of of the lecture on google collab. The jupyter kernel is dying on our test but not sure what would be driving that now. Thanks @longye-tian

Hi Matt @mmcky ,

I just tested in Google Colab; I think the failure to build with Google colab is because the dataset used in the exercise is to large.

data.n_wealth.values

When running the code in Google Colab, this leads the RAM to exhaust and leads to a crash.

I currently changed the code by not computing the whole dataset (30000+ obs) but only with 3000 observations by the following code:

gini_coefficient(data.n_wealth.values[1:3000])

gini(data.n_wealth.values[1:3000])

This resolves the current building failure problem. What do you think about this change?

Best, Longye.

jstac commented 1 day ago

Many thanks @longye-tian for your hard work on this. @mmcky , I'll leave this one for you to merge when ready. There's a small comment above.

I recommend that we go for the simplest options at each stage, focusing on ease of maintenance, not the most ambitious.

mmcky commented 1 day ago

@longye-tian the source data file is only 31mb so I don't understand why google colab would not be able to process it.

A general rule of thumb for pandas is you need 3 x RAM for the amount of data you a processing. I think we should take a look at the code to make sure we aren't creating lots of copies somewhere. It doesn't make sense to me that we would run out of RAM.

mmcky commented 1 day ago

@longye-tian I am currently running a

%prun gini_coefficient(data.n_wealth.values)

and will post the results here when the come in.

longye-tian commented 1 day ago

@longye-tian the source data file is only 31mb so I don't understand why google colab would not be able to process it.

A general rule of thumb for pandas is you need 3 x RAM for the amount of data you a processing. I think we should take a look at the code to make sure we aren't creating lots of copies somewhere. It doesn't make sense to me that we would run out of RAM.

Hi Matt @mmcky ,

The data length for data.n_wealth.values is 31240.

And for the gini function, we use vectorization which create a matrix of size (31240*31240) with 8 bytes per float64, it requires around 8GB memory.

I think that could be a potential reason.

Best, Longye

mmcky commented 1 day ago

we use vectorization which create a matrix of size (31240*31240) with 8 bytes per float64

Nice investigative work @longye-tian. Spot on -- that'a a big matrix!

Screenshot 2024-07-05 at 10 53 52 AM

I had just profiled the non-vectorized code which looked fine from memory perspective.

Screenshot 2024-07-05 at 10 52 19 AM

and memory is

peak memory: 266.78 MiB, increment: 1.45 MiB

this is a really nice example of the tradeoffs between compute and memory :-)

mmcky commented 1 day ago

@longye-tian I think your approach of taking a sample from the full data is a good idea for the exercise. My only question is if the data is ordered or not -- should we take a random sample of 3000 or the first 3000 obs?

Thanks for digging into this with me. It's great we understand the problem now.

longye-tian commented 1 day ago

@longye-tian I think your approach of taking a sample from the full data is a good idea for the exercise. My only question is if the data is ordered or not -- should we take a random sample of 3000 or the first 3000 obs?

Thanks for digging into this with me. It's great we understand the problem now.

Hi Matt @mmcky ,

Thank you for running the test. I think the dataset we got is not ordered. Here is a screenshot of the first 30ish observations:

Screenshot 2024-07-05 at 11 47 26 AM

What do you think about this?

Best, Longye

mmcky commented 1 day ago

thanks @longye-tian I think we should do a random sample of 3000 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) with a seed for consistency.

longye-tian commented 22 hours ago

thanks @longye-tian I think we should do a random sample of 3000 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) with a seed for consistency.

Hi Matt @mmcky ,

Thank you for this suggestion! Just commit to incorporate this.

Best, Longye

mmcky commented 22 hours ago

thanks @longye-tian -- love your work. I think the sample approach is a good approach and tidy.