aarddict / tools

Tools for Aard Dictionary
GNU General Public License v3.0
14 stars 13 forks source link

We already use the separate cssselector module #39

Closed ShadowKyogre closed 9 years ago

ShadowKyogre commented 9 years ago

So why not use it? It also prevents function errors when one tries to use the :contains selector to remove elements that don't have classes nor IDs. This also prevents mwcouch from spewing a load of unicode errors.

itkach commented 9 years ago

"Why not use it" is a pretty weak argument. Can you elaborate on what problems does this change solve? ":contains" selector, while not present in final CSS3, is mentioned as supported in cssselect docs. If it doesn't work then the issue should be reported to cssselect devs. Can you also provide some details on "spewing a load of unicode errors"? I fail to see how/why this change would address something like this. If it's an issue with cssselect it should be reported to cssselect. If it's an issue with mwcouch I'd like to understand it better. Finally, .aar format is being replaced by https://github.com/itkach/slob, with https://github.com/itkach/mwscrape2slob being a replacement for mwcouch converter, consider hacking on mwscrape2slob instead. Issues you bring up most likely equally apply to mwscrape2slob which is why I would like to understand them better.

ShadowKyogre commented 9 years ago

What I meant by "why not use it" was that before I edited the code, it was using the cssselect from lxml, not from the separate cssselect module as specified in the setup.py's requirements.I was working with a large data set (more specifically, the list of cards on http://yugioh.wikia.com ) as part of an attempt to see if I could lay out the pages on there in an offline, mobile friendly format after picking up a copy of the old version of Aarddict from F-Droid (due to a recent craze I had to find all the neat open source applications that I could for my phone). To help reduce the amount of clutter on the rendered pages gathered with mwscrape, I used this filter set with the mwcouch compiler.

.sysop-show
.cardtablespanrow:contains("sets")
.cardtablespanrow:contains("Card appearances")
tr:contains("Anime") + tr
tr:contains("Anime")
tr:contains("Video game statuses") + tr
tr:contains("Video game statuses")
table.collapsible.hslit:contains("pages")

Without the changes in the pull request, only four of the ~9000 cards (those include anime exclusive cards) were able to make it through the compilation process because of two errors: the lower-case XPath function not being defined (when I used the :contains selector) and the unicode problems mentioned earlier. All of the unicode errors which were some variation of invalid start byte. I suspect this is due to the all the unicode characters in the card pages.

If needed, I can do a recompilation with the unmodified version and mwscrape2slob and give you the logs for the precise error messages. A quick peak at the mwscrape2slob source code makes me think that this issue would also appear there due to it using the internal lxml.cssselect and not the external cssselect module.

On 02/15/2015 08:00 PM, itkach wrote:

"Why not use it" is a pretty weak argument. Can you elaborate on what problems does this change solve? ":contains" selector, while not present in final CSS3, is mentioned as supported in cssselect docs. If it doesn't work then the issue should be reported to cssselect devs. Can you also provide some details on "spewing a load of unicode errors"? I fail to see how/why this change would address something like it. If it's an issue with cssselect it should be reported to cssselect. It it's an issue with mwcouch I'd like to understand it better. Finally, .aar format is being replaced by https://github.com/itkach/slob, with https://github.com/itkach/mwscrape2slob being a replacement for mwcouch converter, consider hacking on mwscrape2slob instead. Issues you bring up most likely equally apply to mwscrape2slob which is why I would like to understand them better.

— Reply to this email directly or view it on GitHub https://github.com/aarddict/tools/pull/39#issuecomment-74457348.

itkach commented 9 years ago

lxml.cssselect is a thin wrapper around cssselect which basically does what you do "manually" in this change - converts css selectors to xpath. :contains() selector appears to be working just fine in mwscrape2slob. mwscrape2slob is written in Python 3 while aardtools/mwcouch is Python 2 so unicode errors in mwcouch may be due to differences in unicode handling. Please try mwscrape2slob and report errors, if any, in mwscrape2slob issue tracker.