Closed caigner closed 4 years ago
Do you have an example of a site that has some words on that cause a problem so I can do some testing?
On Thu, 29 Nov 2018 at 11:49, Christian Aigner notifications@github.com wrote:
I create themed crossword puzzles. For that I need wordlists. So I use CeWL to gather words from websites. I use the German language with special characters äöüß.
I noticed that when retrieving words from a website with German words (which were written as HTML entities) CeWL split the words at the HTML entities, removing the HTML entities.
Example: The plural of potatoes in (Austrian) German is
Erdäpfel or
Erdäpfel with HTML entitiy notation
CeWL retrieved the word and split it:
Erd
pfel
Please add an option to convert HTML entities, so that words in other languages than English can also correctly retrieved.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/44, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWdX9vTsoiPVwncaekwlNcl0cYrDMks5uz8mugaJpZM4Y5gaj .
You can try www.creative-geocaching.at
cewl -v -c -d 1 -m 3 -w wordlist.txt http://www.creative-geocaching.at
There is the word "Österreich" which gets truncated to "sterreich", and the word "Datenschutzbehörde", which gets split into "Datenschutzbeh" and "rde".
I just looked and the text on this website is without HTML-entities. So the problem might be that CeWL can't handle UTF8 charset? Could that be it?
That is very likely the problem.
I'm British and in the UK so don't have to worry about anything other than basic ASCII which means most of my tools break when they encounter any other type of encoding. This has come up before and I thought it had been fixed but I'll have another look. What I really need is a developer who understands encoding and can probably fix it all in a couple of lines of code and then explain to me what they've done so I can reproduce it in the future.
On Thu, 29 Nov 2018 at 13:58, Christian Aigner notifications@github.com wrote:
I just looked and the text on this website is without HTML-entities. So the problem might be that CeWL can't handle UTF8 charset? Could that be it?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/44#issuecomment-442842784, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWTI9WNJYXiFuNiF2N938agDjLVVMks5uz-gbgaJpZM4Y5gaj .
Unfortunately I know nothing about Ruby. :-(
I keep trying that excuse but no one believes me :) On Thu, 29 Nov 2018 at 14:14, Christian Aigner notifications@github.com wrote:
Unfortunately I know nothing about Ruby. :-(
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Right now I am running CeWL with the -v option. And on the screen the text is absolutely correct with all äöüÄÖÜß in the right places. So something must happen when you process the words.
I just tried your command from earlier and got Hauptstraße in my wordlist, do you get that?
Your output looks ok. My output looks like this:
Daten, 6
Sie, 6
uns, 5
Ihre, 5
der, 5
Geocache, 4
Christian, 4
Aigner, 4
Innovatives, 3
mit, 3
die, 3
und, 3
Creative, 2
Geocaching, 2
Design, 2
rung, 2
Ihrer, 2
ist, 2
Wir, 2
Rahmen, 2
wir, 2
Kontakt, 2
Wenn, 2
bei, 2
das, 2
Verarbeitung, 2
rde, 2
Geocaches, 1
Wow, 1
Faktor, 1
Einzelunternehmer, 1
Hauptstra, 1
Kaltenleutgeben, 1
UID, 1
ATU, 1
Datenschutzerkl, 1
Erkl, 1
I assume from the command line you gave that you aren't using the github version, check that out and try it.
Ok, will do that.
Before I installed cewl-5.4.3 from the Gentoo repository. Now I downloaded it from GitHub. And it works! First it complained that the mime gem was missing, so I installed it. I am happy! :-)
I've not been a Gentoo user for years so don't keep track of their versions.
The quick way to install all the gems is to run
bundle install
from the CeWL directory, that will take care of all the dependencies.
Glad it is working and just as glad I don't have to go fighting with encoding again!
Thanks for your quick help! :-)
I create themed crossword puzzles. For that I need wordlists. So I use CeWL to gather words from websites. I use the German language with special characters äöüß.
I noticed that when retrieving words from a website with German words (which were written as HTML entities) CeWL split the words at the HTML entities, removing the HTML entities.
Example: The plural of potatoes in (Austrian) German is
CeWL retrieved the word and split it:
Please add an option to convert HTML entities, so that words in other languages than English can also correctly retrieved.