asepaprianto / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Scraping iframes, base64,vb scripts #328

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

how do I extract iframes,base64,vb scrips and other self executing scripts 
using crawler4j.

What is the expected output? What do you see instead?
I want to get above mentioned embedded codes for a given html page.

What version of the product are you using?

Please provide any additional information below.
What I have done till now for the solution:-

I have gone through almost all the classes in the source code and I have found 
htmlContentHandler class. I have created a new class(htmlContentHandler) in the 
Controller class. Now from the visit method I have created a object for 
htmlContentHandler and till here every thing is fine. The problem is with the 
startElement method in htmlcontenthandler class. I am not understanding what 
parameter values should i give from visit function.For sample code I have 
attached a file for the controller class. Any help will be appreciated!!

Original issue reported on code.google.com by yenumula...@gmail.com on 17 Dec 2014 at 10:06

Attachments:

GoogleCodeExporter commented 9 years ago
Can you provide a URL which contains those things you want to scrape ?

Original comment by avrah...@gmail.com on 17 Dec 2014 at 10:26

GoogleCodeExporter commented 9 years ago
Thanks for the reply avrah! Here is my aim I want to crawl this domain 
"http://www.sakshi.com" and extract all the iframe codes,base 64 codes etc.. 
only if they are present I am quite sure that this domain contains iframes,but 
i am not sure about the rest(base 64, embed codes). 

Original comment by yenumula...@gmail.com on 17 Dec 2014 at 12:06

GoogleCodeExporter commented 9 years ago
The way you try to do it it seems that it will take the iFrame URLs and put 
them into the list of the URLs of the page - it seems to be ok, but I am not 
sure this is what you want.

I think the best way for you to do it (if I understand your requirement) is to 
use the visit() method, where you can find the html code of every visited page, 
extract the iframe code from the html string!

Does this help ?

Original comment by avrah...@gmail.com on 17 Dec 2014 at 12:26

GoogleCodeExporter commented 9 years ago
Exactly! extracting iframes from html string is what I have tried before 
posting the issue and I have attached the code to extract iframes and save the 
iframe code in to a text file.But the problem is that I know iframe starts with 
<iframe tag and ends with </iframe> tag. But in case base 64 code,vb scripts, 
embed codes I am not understanding how they start and end in a html.So that is 
y I am trying to htmlcontenthandler class! can u please help on that!

Original comment by yenumula...@gmail.com on 17 Dec 2014 at 12:44

Attachments:

GoogleCodeExporter commented 9 years ago
To parse iFrame use these:
http://stackoverflow.com/questions/13646163/how-to-get-body-holding-the-content-
of-iframe-in-java

http://stackoverflow.com/questions/26515383/jsoup-not-parsing-iframe-out-of-html

In order to try to parse anything else I need a solid example - scenario, give 
me a URL with that code and I will see how to parse it.

Without an example you can't even check if it works

Original comment by avrah...@gmail.com on 17 Dec 2014 at 12:49

GoogleCodeExporter commented 9 years ago
Invalid as discussion was stopeed and the need is probably gone

Original comment by avrah...@gmail.com on 22 Jan 2015 at 11:42