Closed wiseliu closed 8 years ago
Hopefully these answer will help:
i provide the crawler with list of URLs in a text file
<startURLs>
<urlsFile>/path/to/file/with/one/url/per/line.txt</urlsFile>
</startURLs>
the crawler will crawl the URLs, get the text content, remove all special character from text(only allow alphanumeric characters), and minify the text into single line text
You can use the ReplaceTransformer to keep only alpha numeric characters and have only a single line.
committer will write the single line text with reference URL into a single csv file
You'll have to write your own ICommitter implementation for this. This will give you all the control you need to save each documents exactly how you want and where you want.
Are your questions answered?
Thanks for the answer. for URLs list and transformer, i already tested and it works like charm. for ICommitter, could you give me a very simple implementation for my reference?
Thanks wiseliu
Sure, you can look at the FileSystemCommitter for ideas. But in reality, your case, should be relatively simple. You should only have to worry about implementing the add
method:
public void add(String reference, InputStream content, Properties metadata)
Since you made the content 1 line, you should be able to read the "content" argument into a String. Then append that string along with the reference argument (your URL) to a CVS file you have created.
Thanks for the pointer, however as i set my crawlers to run in 8 threads, will the committer able to write the file? the result will be only 1 csv file, so will there be deadlock, or very heavy I/O operation?
Thanks wiseliu
Unless your crawl is super aggressive, a quick way around this issue is to mark the add method "synchonized". Then only one thread at a time will write to that file. If it becomes too much of a concern for you, you may want to consider using a database or similar instead of a flat-file.
Thank you for your help and pointers, now everything works as intended. Hope the best for dev team
Thanks wiseliu
Dear Dev Team
I'm new to Norconex HTTP Collector, and will use it for my thesis. The scenario for my crawling is:
In the end of crawling, the result i will get is one big csv file with all the content and reference URLs example of csv file is:
'this is one liner string',www.example.com 'this is the second one liner',www.anotherexample.com 'this is last one liner',www.lastexample.com
Could you help me to create crawler configuration or give me hint how to integrate it into Java program that will suit to my scenario? I already try to use example config provided in tutorial section, however the result format is not suitable for my project. Also, i already read the doc, but i don't know where to start or begin. Your help will be much appreciated.
the system info for my crawler machine: Norconex http-collector 2.4.1 SNAPSHOT Windows 8.1 Pro
Thank You wiseliu