WheatonCS / Lexos

Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:
http://lexos.wheatoncollege.edu
MIT License
118 stars 20 forks source link

pandas v0.24 performance issues #892

Open jacksonjreed opened 5 years ago

jacksonjreed commented 5 years ago

We noticed a serious time difference in our stats routines between the server and our local copies--as it turns out, this is a result of a pandas version difference. Lines 197 and 203 of stats_model.py run as much as 10x slower on pandas v0.24.2 (which is on our dev boxes) compared to v0.23.4 (which is on our servers). Compare to line 200, which runs relatively fast on both versions--the addition of the eq() or ne() is the key difference.

here's some timing data (in seconds) that I grabbed on both versions: 0.24.2 getting labels: 2.9769787788391113 setting up data frame: 0.010141849517822266 tokens that appear once: 24.94732117652893 total # tokens: 1.0076451301574707 distinct # tokens: 21.80290699005127 average: 0.0008835792541503906

0.23.4 getting labels: 2.9242076873779297 setting up data frame: 0.005937099456787109 tokens that appear once: 2.862314224243164 total # tokens: 1.5265724658966064 distinct # tokens: 1.0599994659423828 average: 0.01018071174621582

czhang03 commented 5 years ago

Can you give me the links of these two lines. I will try to look into it.

czhang03 commented 5 years ago

Okay, I have looked into it. It is nice to post the link to the line instead of just the line number.

Would you please test the performance hit is caused by the eq and ne method or is it caused by the sum method.

jacksonjreed commented 5 years ago

https://github.com/WheatonCS/Lexos/blob/c439954d1c6697cbeb32cb8c126fdece0a051164/lexos/models/stats_model.py#L193-L207 Here's the link, sorry about that! My reasoning to assume that sum() is not responsible for the performance hit is that the function on line 200, which does include a sum() but not a eq() or ne(), runs fast on both versions.

czhang03 commented 5 years ago

reasonable we need to replace the eq and ne then.

According to the documentation, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eq.html#pandas.DataFrame.eq eq is the same as ==, maybe try change it into the operator?

If this does not help we will try to take the value of the dataframe out and see if it is caused by numpy.

czhang03 commented 5 years ago

Another way to go is to use np.ones instead of 1. But make it very clear in the comment this is to resolve the performance hit in panda 0.24.2

jacksonjreed commented 5 years ago

An update on this issue (which I still plan on looking into some more): Topwords seems to be broken on v0.23.*, so we will definitely want to update the server--which means investigating the performance hit is even more important.

mleblanc321 commented 5 years ago

jackson -- why doesn't caleb update the server to 0.24.x ?

jacksonjreed commented 5 years ago

I think it is on 0.24 now