Closed GoogleCodeExporter closed 9 years ago
Use this diff instead, there was one-line that was off in the previous one.
Original comment by mrj...@gmail.com
on 17 Aug 2011 at 1:36
Attachments:
After just glancing at -- but not yet running -- the changes, a lot of the
patch looks pretty good for performance improvements. For the benefit of
others, it looks like the patch:
* replaces most lists with sets, since they're just used for membership tests
* moves methods out of classes to avoid method resolution
* ditches silly string formatting code that uses `locals()`
* replaces `has_key()` membership tests with `in` tests
I have on my todo list to remove the `locals()` calls, but I'm really
interested in the list/set switch. I also have a few thoughts pertaining to
compatibility:
`any()` was only introduced in Python 2.5. I expect it would be better than
using the current `filter()` code, but it will be necessary to conditionally
create a pure-Python fallback for Python 2.4. It's a simple addition, though,
so I'm delighted with the more readable code!
If I recall correctly the `has_key()` garbage is necessary for one or two of
the older supported Python versions' HTTP headers or XML attributes objects;
the objects are dict-alikes that don't actually support the `in` tests. This
absurdity was fixed in later Python versions, but older Pythons require
`has_key()` membership tests (and calls to `keys()` for that matter). That's my
memory of why I didn't replace the `has_key()` calls, anyway.
Finally, I'm not immediately sold on moving the `unique()`, `normalize()`, and
`toISO8601()` methods into the global namespace, but I won't argue if it gives
performance-conscious developers ideas about how to further speed up feedparser!
I'm grateful for the patch, and I'm looking forward to testing it!
Original comment by kurtmckee
on 17 Aug 2011 at 6:39
Great, thanks for the quick response. I agree with your comments about older
Python versions.
One question for you, though -- on the "polluting the namespace" issue, is that
to handle the "from feedparser import *" case? If so, can't you just define
__all__ = ["parse"] ?
Original comment by mrj...@gmail.com
on 17 Aug 2011 at 7:01
I've incrementally applied most of your patch, culminating in r568.
Of note, I skipped the addition of `any()`, and didn't move the
microformat-related functions into the global namespace. My reasoning was that
the microformat code relies on BeautifulSoup, which isn't Python 3 compatible
and probably will never be. The microformat code is also a major source of
crashes and unicode problems, and I want to replace it. Consequently, it seems
better to keep those functions with the class that will eventually be revamped
or completely replaced.
I also discovered that the `has_key()` concerns I noted above were unfounded. I
guess my memory is of something else.
I do have a question for you, though! How are you measuring performance
improvements? I run the unit tests across Python 2.4 through Python 3.2 and see
wild variations if I'm watching a video, visit a Javascript-heavy webpage, or
stop my music player. cProfile only reveals algorithmic issues, but isn't very
useful for measuring performance improvements.
And finally, in the commit messages I've credited your Google username; if you
provide your real name I'd be delighted to give you better credit!
Original comment by kurtmckee
on 28 Aug 2011 at 5:58
Hi Kurt,
Thanks for applying the changes. Have you looked at the locals() change also?
I was inspired by a post about performance of version 5.0
(http://blog.yjl.im/2011/02/python-universal-feed-parser-5-speed.html). For
testing it, I used those Blogger test files he uses as well as a few feeds like
Techcrunch. I also made sure the unit tests didn't run slower (albeit those
aren't really setup for performance testing).
If you'd like to credit my real name, you can mention "John Benediktsson".
By the way, I've gotten a 45% speedup using lxml for parsing microformats
instead of BeautifulSoup. It passes all the unit tests, too. I'm tinkering
with a full lxml port of feedparser that should take about 5-10% of the time
the current one does, although I'm struggling with one possible bug in lxml
right now. Are you interested?
Best,
John.
Original comment by mrj...@gmail.com
on 28 Aug 2011 at 6:06
I definitely applied the locals() changes. :)
Thanks for linking me to that blog entry; the author's Python 2.5 vs 2.6
numbers indicate that he very likely has BeautifulSoup in his Python 2.6 path,
which is probably why he's seeing quadrupled run times, but I'll post a better
analysis on my own blog one of these days. The ponderous feed files will be
great to run through cProfile, too.
Performance is important to me, and is only second to readable, maintainable
code, so yes, I'm interested in your lxml work, and keep them patches a-comin'!
Original comment by kurtmckee
on 28 Aug 2011 at 7:33
Original issue reported on code.google.com by
mrj...@gmail.com
on 17 Aug 2011 at 1:31Attachments: