Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Store crawl result in single csv file with reference URL #248

Closed wiseliu closed 8 years ago

wiseliu commented 8 years ago

Dear Dev Team

I'm new to Norconex HTTP Collector, and will use it for my thesis. The scenario for my crawling is:

  1. i provide the crawler with list of URLs in a text file
  2. the crawler will crawl the URLs, get the text content, remove all special character from text(only allow alphanumeric characters), and minify the text into single line text
  3. committer will write the single line text with reference URL into a single csv file

In the end of crawling, the result i will get is one big csv file with all the content and reference URLs example of csv file is:

'this is one liner string',www.example.com 'this is the second one liner',www.anotherexample.com 'this is last one liner',www.lastexample.com

Could you help me to create crawler configuration or give me hint how to integrate it into Java program that will suit to my scenario? I already try to use example config provided in tutorial section, however the result format is not suitable for my project. Also, i already read the doc, but i don't know where to start or begin. Your help will be much appreciated.

the system info for my crawler machine: Norconex http-collector 2.4.1 SNAPSHOT Windows 8.1 Pro

Thank You wiseliu

essiembre commented 8 years ago

Hopefully these answer will help:

i provide the crawler with list of URLs in a text file

<startURLs>
    <urlsFile>/path/to/file/with/one/url/per/line.txt</urlsFile>
</startURLs>

the crawler will crawl the URLs, get the text content, remove all special character from text(only allow alphanumeric characters), and minify the text into single line text

You can use the ReplaceTransformer to keep only alpha numeric characters and have only a single line.

committer will write the single line text with reference URL into a single csv file

You'll have to write your own ICommitter implementation for this. This will give you all the control you need to save each documents exactly how you want and where you want.

Are your questions answered?

wiseliu commented 8 years ago

Thanks for the answer. for URLs list and transformer, i already tested and it works like charm. for ICommitter, could you give me a very simple implementation for my reference?

Thanks wiseliu

essiembre commented 8 years ago

Sure, you can look at the FileSystemCommitter for ideas. But in reality, your case, should be relatively simple. You should only have to worry about implementing the add method:

public void add(String reference, InputStream content, Properties metadata)

Since you made the content 1 line, you should be able to read the "content" argument into a String. Then append that string along with the reference argument (your URL) to a CVS file you have created.

wiseliu commented 8 years ago

Thanks for the pointer, however as i set my crawlers to run in 8 threads, will the committer able to write the file? the result will be only 1 csv file, so will there be deadlock, or very heavy I/O operation?

Thanks wiseliu

essiembre commented 8 years ago

Unless your crawl is super aggressive, a quick way around this issue is to mark the add method "synchonized". Then only one thread at a time will write to that file. If it becomes too much of a concern for you, you may want to consider using a database or similar instead of a flat-file.

wiseliu commented 8 years ago

Thank you for your help and pointers, now everything works as intended. Hope the best for dev team

Thanks wiseliu