problem with HTML entities

caigner commented 5 years ago

I create themed crossword puzzles. For that I need wordlists. So I use CeWL to gather words from websites. I use the German language with special characters äöüß.

I noticed that when retrieving words from a website with German words (which were written as HTML entities) CeWL split the words at the HTML entities, removing the HTML entities.

Example: The plural of potatoes in (Austrian) German is

Erdäpfel or 
Erd&auml;pfel with HTML entitiy notation

CeWL retrieved the word and split it:

Erd
pfel

Please add an option to convert HTML entities, so that words in other languages than English can also correctly retrieved.

digininja commented 5 years ago

Do you have an example of a site that has some words on that cause a problem so I can do some testing?

On Thu, 29 Nov 2018 at 11:49, Christian Aigner notifications@github.com wrote:

I create themed crossword puzzles. For that I need wordlists. So I use CeWL to gather words from websites. I use the German language with special characters äöüß.

I noticed that when retrieving words from a website with German words (which were written as HTML entities) CeWL split the words at the HTML entities, removing the HTML entities.

Example: The plural of potatoes in (Austrian) German is

Erdäpfel or

Erdäpfel with HTML entitiy notation

CeWL retrieved the word and split it:

Erd

pfel

Please add an option to convert HTML entities, so that words in other languages than English can also correctly retrieved.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/44, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWdX9vTsoiPVwncaekwlNcl0cYrDMks5uz8mugaJpZM4Y5gaj .

caigner commented 5 years ago

You can try www.creative-geocaching.at

cewl -v -c -d 1 -m 3 -w wordlist.txt http://www.creative-geocaching.at

There is the word "Österreich" which gets truncated to "sterreich", and the word "Datenschutzbehörde", which gets split into "Datenschutzbeh" and "rde".

caigner commented 5 years ago

I just looked and the text on this website is without HTML-entities. So the problem might be that CeWL can't handle UTF8 charset? Could that be it?

digininja commented 5 years ago

That is very likely the problem.

I'm British and in the UK so don't have to worry about anything other than basic ASCII which means most of my tools break when they encounter any other type of encoding. This has come up before and I thought it had been fixed but I'll have another look. What I really need is a developer who understands encoding and can probably fix it all in a couple of lines of code and then explain to me what they've done so I can reproduce it in the future.

On Thu, 29 Nov 2018 at 13:58, Christian Aigner notifications@github.com wrote:

I just looked and the text on this website is without HTML-entities. So the problem might be that CeWL can't handle UTF8 charset? Could that be it?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/44#issuecomment-442842784, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWTI9WNJYXiFuNiF2N938agDjLVVMks5uz-gbgaJpZM4Y5gaj .

caigner commented 5 years ago

Unfortunately I know nothing about Ruby. :-(

digininja commented 5 years ago

I keep trying that excuse but no one believes me :) On Thu, 29 Nov 2018 at 14:14, Christian Aigner notifications@github.com wrote:

Unfortunately I know nothing about Ruby. :-(

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

caigner commented 5 years ago

Right now I am running CeWL with the -v option. And on the screen the text is absolutely correct with all äöüÄÖÜß in the right places. So something must happen when you process the words.

digininja commented 5 years ago

I just tried your command from earlier and got Hauptstraße in my wordlist, do you get that?

caigner commented 5 years ago

Your output looks ok. My output looks like this:

Daten, 6
Sie, 6
uns, 5
Ihre, 5
der, 5
Geocache, 4
Christian, 4
Aigner, 4
Innovatives, 3
mit, 3
die, 3
und, 3
Creative, 2
Geocaching, 2
Design, 2
rung, 2
Ihrer, 2
ist, 2
Wir, 2
Rahmen, 2
wir, 2
Kontakt, 2
Wenn, 2
bei, 2
das, 2
Verarbeitung, 2
rde, 2
Geocaches, 1
Wow, 1
Faktor, 1
Einzelunternehmer, 1
Hauptstra, 1
Kaltenleutgeben, 1
UID, 1
ATU, 1
Datenschutzerkl, 1
Erkl, 1

digininja commented 5 years ago

I assume from the command line you gave that you aren't using the github version, check that out and try it.

caigner commented 5 years ago

Ok, will do that.

caigner commented 5 years ago

Before I installed cewl-5.4.3 from the Gentoo repository. Now I downloaded it from GitHub. And it works! First it complained that the mime gem was missing, so I installed it. I am happy! :-)

digininja commented 5 years ago

I've not been a Gentoo user for years so don't keep track of their versions.

The quick way to install all the gems is to run

bundle install

from the CeWL directory, that will take care of all the dependencies.

Glad it is working and just as glad I don't have to go fighting with encoding again!

caigner commented 5 years ago

Thanks for your quick help! :-)

digininja / CeWL

problem with HTML entities #44