Closed stahnirockt closed 4 years ago
Are you saying 5.4.2 isn't getting words off the front page?
On Sun, 4 Mar 2018, 16:42 stahnirockt, notifications@github.com wrote:
When using the preinstalled cewl (version 5.3) on Kali, I can use -d 0 to get only results from the webpage I want. Cloning and using version 5.4.2 from GitHub I didn't get entries with -d 0, only with -d 1 but then I haven't the results of the wanted page only the "subpages".
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWWZ6Lt519JgIxLPZCVRbqdmMd9sQks5tbBldgaJpZM4SbZmr .
Yes, that's what I wanted to say. I've tried a bit more and the problem does not occur on all websites. For example, https://github.com gives the same result in both versions. But https://en.wikipedia.org/wiki/Computer does not provide frontpage results in version 5.4.2, but it does in version 5.3. Same result with every wikipedia entry. Any idea, what could be the problem. I was trying this on mac and linux, same results.
I've not got a Kali box to try it on but I'll make sure that the depth feature works as expected on the Github master.
Will probably be the next couple of days before I can look at it though. If I don't get back to you by the end of the week give me a nudge.
On Sun, 4 Mar 2018 at 20:02 stahnirockt notifications@github.com wrote:
Yes, that's what I wanted to say. I've tried a bit more and the problem does not occur on all websites. For example, https://github.com gives the same result in both versions. But https://en.wikipedia.org/wiki/Computer does not provide frontpage results in version 5.4.2, but it does in version 5.3. Same result with every wikipedia entry. Any idea, what could be the problem. I was trying this on mac and linux, same results.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/32#issuecomment-370259015, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWeXSmJTXSv1shYOaIuzv8Vl6qfQaks5tbEhggaJpZM4SbZmr .
Today, I've tried a little further. If I create a website with a link to a wikipedia entry and append '-d 1' and '-o', I get the results of the wanted page.
Also, it seems that commenting out line 706 solved the problem for me.
# The spider doesn't work properly if there isn't a / on the end
if url !~ /\/$/
# url = "#{url}/"
end
It was also commented out in version 5.3.
That change went in because there was another issue raised that it was stopping spidering working with it there. I'll have to do some proper digging, my guess is it has something to do with sites that do automatic redirection from URLs without trailing slashes to with a slash.
On Tue, 6 Mar 2018 at 09:39 stahnirockt notifications@github.com wrote:
Today, I've tried a little further. If I create a website with a link to a wikipedia entry and append '-d 1' and '-o', I get the results of the wanted page.
Also, it seems that commenting out line 706 solved the problem for me.
The spider doesn't work properly if there isn't a / on the end
if url !~ /\/$/
url = "#{url}/"
end
It was also commented out in version 5.3.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/32#issuecomment-370721446, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWSstxTeAXDQq-JFB8gyvcrWgg7Glks5tblk9gaJpZM4SbZmr .
Assume it is all working now, not had any recent complaints.
When using the preinstalled cewl (version 5.3) on Kali, I can use -d 0 to get only results from the webpage I want. Cloning and using version 5.4.2 from GitHub I didn't get entries with -d 0, only with -d 1 but then I haven't the results of the wanted page only the "subpages".