Closed longye-tian closed 22 hours ago
Name | Link |
---|---|
Latest commit | 6e2d53ef9ccc1b2c0b43214f624dbd06887c2ab0 |
Latest deploy log | https://app.netlify.com/sites/taupe-gaufre-c4e660/deploys/668774623d4e3f00080ac468 |
Deploy Preview | https://deploy-preview-498--taupe-gaufre-c4e660.netlify.app |
Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
thanks @longye-tian this is looking really good.
I had a Quick Look and I think it would be great to add timing comparisons between the two functions to demonstrate the how much quicker the vectorized code works.
thanks @longye-tian this is looking really good.
I had a Quick Look and I think it would be great to add timing comparisons between the two functions to demonstrate the how much quicker the vectorized code works.
Hi Matt @mmcky ,
I just updated the solution and main text by adding the %%time
to show the computation time.
What do you think about this comparison?
Best, Longye
@longye-tian just have an undefined label
@longye-tian just have an undefined label
- [ ] /home/runner/work/lecture-python-intro/lecture-python-intro/lectures/inequality.md:1101: WARNING: undefined label: code:gini-coefficient
Thank you very much Matt!
I just added the label:
(code:gini-coefficient) =
I hope this will pass the checks 👍
Best, Longye
@longye-tian the kernel is dying when testing agains the google collab
environment.
Would you mind running the notebook version of this PR on Google Collab to see if you can replicate this issue?
You can use jupytext
to convert between the two formats.
https://manual.quantecon.org/writing/converting.html#myst-markdown-md-to-jupyter-notebook-ipynb
Hi Matt @mmcky ,
Thank you for this information. Google colab returns the following error:
For the code:
!pip install quantecon
import quantecon as qe
varlist = ['n_wealth', # net wealth
't_income', # total income
'l_income'] # labor income
df = df_income_wealth
# create lists to store Gini for each inequality measure
results = {}
for var in varlist:
# create lists to store Gini
gini_yr = []
for year in years:
# repeat the observations according to their weights
counts = list(round(df[df['year'] == year]['weights'] ))
y = df[df['year'] == year][var].repeat(counts)
y = np.asarray(y)
rd.shuffle(y) # shuffle the sequence
# calculate and store Gini
gini = qe.gini_coefficient(y)
gini_yr.append(gini)
results[var] = gini_yr
# Convert to DataFrame
results = pd.DataFrame(results, index=years)
results.to_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_label='year')
It returns
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
[<ipython-input-20-a7cb0b0c20b0>](https://localhost:8080/#) in <cell line: 32>()
30 # Convert to DataFrame
31 results = pd.DataFrame(results, index=years)
---> 32 results.to_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_label='year')
4 frames
[/usr/local/lib/python3.10/dist-packages/pandas/io/common.py](https://localhost:8080/#) in check_parent_directory(path)
598 parent = Path(path).parent
599 if not parent.is_dir():
--> 600 raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
601
602
OSError: Cannot save file into a non-existent directory: '_static/lecture_specific/inequality'
What do you think about this issue?
Best, Longye
thanks @longye-tian. That makes sense.
Do we import that data later in the lecture, or in another lecture? If not, we could remove the to_csv
method otherwise
My thought is:
_static/lecture_specific/inequality/data.ipynb
to_csv
method in the lecture code. thanks @longye-tian. That makes sense.
Do we import that data later in the lecture, or in another lecture? If not, we could remove the
to_csv
method otherwiseMy thought is:
- we should have the code that generates that static data as a notebook in
_static/lecture_specific/inequality/data.ipynb
- remove the
to_csv
method in the lecture code.
Hi Matt @mmcky ,
I will look into it ! Thank you for the idea.
Best, Longye
Hi Matt @mmcky ,
Just updated the data.ipynb and the code.
What do you think?
Best, Longye
thanks @longye-tian can you let me know where the data is imported? Is it later in the same lecture or in another lecture?
thanks @longye-tian can you let me know where the data is imported? Is it later in the same lecture or in another lecture?
Hi Matt @mmcky ,
In the same lecture, just after the save csv code, we have,
ginis = pd.read_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_col='year')
which also reports an error in the google colab. So I changed that code too.
Another issue I find in this lecture is when I use Google colab to run the updated version, some code such as
gini_coefficient(data.n_wealth.values)
in the exercise section, takes 10 minutes to run and later lead my colab page to crash after using all available RAM.
Best, Longye
thanks @longye-tian let me dive into this PR after lunch. I suspect the initial thought was to fetch the raw data from github directly (and there is probabably a no-execute
tag on the data generating cell. I'll review this.
@longye-tian I have made some adjustments and simplifications now that the data.ipynb
has been committed. The code block that contained the data generating code had skip-execution
as a tag but google collab doesn't understand that context. So I have switched to a download role.
This PR won't execute as the data is not yet available on github. But if you are happy with everything else I can merge and test.
@longye-tian can you test out this latest version of of the lecture on google collab. The jupyter kernel is dying on our test but not sure what would be driving that now. Thanks @longye-tian
@longye-tian can you test out this latest version of of the lecture on google collab. The jupyter kernel is dying on our test but not sure what would be driving that now. Thanks @longye-tian
Hi Matt @mmcky ,
I just tested in Google Colab; I think the failure to build with Google colab is because the dataset used in the exercise is to large.
data.n_wealth.values
When running the code in Google Colab, this leads the RAM to exhaust and leads to a crash.
I currently changed the code by not computing the whole dataset (30000+ obs) but only with 3000 observations by the following code:
gini_coefficient(data.n_wealth.values[1:3000])
gini(data.n_wealth.values[1:3000])
This resolves the current building failure problem. What do you think about this change?
Best, Longye.
Many thanks @longye-tian for your hard work on this. @mmcky , I'll leave this one for you to merge when ready. There's a small comment above.
I recommend that we go for the simplest options at each stage, focusing on ease of maintenance, not the most ambitious.
@longye-tian the source data file is only 31mb so I don't understand why google colab would not be able to process it.
A general rule of thumb for pandas
is you need 3 x RAM for the amount of data you a processing. I think we should take a look at the code to make sure we aren't creating lots of copies somewhere. It doesn't make sense to me that we would run out of RAM.
@longye-tian I am currently running a
%prun gini_coefficient(data.n_wealth.values)
and will post the results here when the come in.
@longye-tian the source data file is only 31mb so I don't understand why google colab would not be able to process it.
A general rule of thumb for
pandas
is you need 3 x RAM for the amount of data you a processing. I think we should take a look at the code to make sure we aren't creating lots of copies somewhere. It doesn't make sense to me that we would run out of RAM.
Hi Matt @mmcky ,
The data length for data.n_wealth.values
is 31240.
And for the gini function, we use vectorization which create a matrix of size (31240*31240) with 8 bytes per float64, it requires around 8GB memory.
I think that could be a potential reason.
Best, Longye
we use vectorization which create a matrix of size (31240*31240) with 8 bytes per float64
Nice investigative work @longye-tian. Spot on -- that'a a big matrix!
I had just profiled the non-vectorized code which looked fine from memory perspective.
and memory is
peak memory: 266.78 MiB, increment: 1.45 MiB
this is a really nice example of the tradeoffs between compute and memory :-)
@longye-tian I think your approach of taking a sample from the full data is a good idea for the exercise. My only question is if the data is ordered or not -- should we take a random sample of 3000 or the first 3000 obs?
Thanks for digging into this with me. It's great we understand the problem now.
@longye-tian I think your approach of taking a sample from the full data is a good idea for the exercise. My only question is if the data is ordered or not -- should we take a random sample of 3000 or the first 3000 obs?
Thanks for digging into this with me. It's great we understand the problem now.
Hi Matt @mmcky ,
Thank you for running the test. I think the dataset we got is not ordered. Here is a screenshot of the first 30ish observations:
What do you think about this?
Best, Longye
thanks @longye-tian I think we should do a random sample of 3000 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) with a seed for consistency.
thanks @longye-tian I think we should do a random sample of 3000 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) with a seed for consistency.
Hi Matt @mmcky ,
Thank you for this suggestion! Just commit to incorporate this.
Best, Longye
thanks @longye-tian -- love your work. I think the sample approach is a good approach and tidy.
Hi Matt @mmcky ,
I have updated the exercise 3 of the inequality lecture using your code in #410 and add the simulation part below your solution.
What do you think about this version of the solution?
Best, Longye