Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Dynamically adding pages to the "seeds" (startup pages)&"frontier" (list of pages to crawl) #256

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

Hi,

Since I got very good answers to my previous questions, I'll add some :)

I'm faced with another problem: I have a site that acts like a modern search engine, with dynamic generation of the HTML results-pages from JSON response from an API. What the sites does behind the scene, and that you have to reproduce to crawl it, is:

My questions with respect to crawling such a site are:

  1. Is it possible to dynamically generate the startup pages? I know it is possible to read them from a file, but I would prefer to create a Generator that would read a file with a bunch of queries, then generate on-the-fly the seeds/startURLs, like http://api.domain.com/api/script?query=&xxx What is the class I should overload in this case?
  2. Is there a native JSON parser/importer to help me extract the "homepage"s attributes of the returned JSON object, or should I write one? (according to what I see here: https://www.norconex.com/collectors/importer/latest/apidocs/ there is none...)
  3. Is it possible to re-introduce the extracted "homepage"s into the list of URLs still-to-be-crawled?
  4. Is it possible to create such an interlinked 2-step crawler, 1rst-step being the crawling & extraction from JSON results of "search-pages", resulting in the insertion of URLs to the 2nd-step crawler that consists in crawling & extracting from normal HTML pages?
liar666 commented 8 years ago

Concerning extraction from JSON, I found this post: https://www.norconex.com/how-to-crawl-facebook/ I don't really like the regex option to parse JSON, I would prefer to work directly with the JSON Object. Since this is examplified in the second piece of code, I should be able to do what I want with this post. So you can ignore question 2 :)

essiembre commented 8 years ago

Question 1: Not sure I understand. You can have a seed file with ANY URLs in it so those URLs can definitely be queries with arguments. Are you saying the list itself is not known upfront and you want to generate it before a crawl? Right now you can not have a custom "start url generator" but you can easily achieve the equivalent a few different ways. You can create a dynamic webpage that produces your list of dynamic URLs and have that new page be your start URL. You can also use an external process of your own to create the seed file with URLs you want before launching the crawler (you can automate running one before the other via simple shell scripting).

Question 2: Ignored :-)

Question 3: Yes, but URL extraction is done as a separate step from "importing" (since sometimes you want to extract and follow URLs in a document you otherwise don't want to import). For this, you can implement your own link extractor. By default, GenericLinkExtractor is used, but you can create your own ILinkExtractor implementation. You can define multiple link extractor in the same configuration (e.g., you want to extract links from both HTML and JSON content).

Question 4: Yes it is possible, see previous answer.

liar666 commented 8 years ago

Hi again,

  1. If I mix your answers 1&3, as follows: write an implementation of ILinkExtractor (like your answer3 suggests), but which would not extract anything, just generate (new) links, wouldn't I be obtaining a sort of startURL generator like what I was asking in question 1?

I've almost finished writing the crawler for type of sites I described initially (i.e. JSON+HTML), but I still have 2 questions:

  1. In fact, the exact working of the site is slightly more complicated than I described initially: indeed, the JSON provided in the first step does not contains the full set of results, but just a subset, and you have to ask the API for the remaining results by iterating over a similar URL as the initial query, adding "&offset=xxx", to get next K results after result number xxx (K being fixed to 20 in the API). To do so, in the ILinkExtractor that I wrote, I extracted the "homepage"s from the JSON, added them to the 'links' Set, and I also automatically generated the next API call and added it to the 'links' Set returned by my ILinkExtractor. Unfortunately, I'm not really sure now how to make sure the JSON pages are only treated by the 'ILinkExtractor' and the HTML pages only treated by my (HTML) 'importer' config. For the moment I'm thinking of doning like foolows:
    • Add all (JSON+HTML) URLs to the global crawler's
    • Write an accepts() method in my "ILinkExtractor' that returns true only for JSON 'reference's
    • Add only the (HTML) links to a in my 'importer'. Is that the correct approach?

3) I've moved the second question in another issue, since it's quite different from the initial question I asked here. See https://github.com/Norconex/collector-http/issues/258

essiembre commented 8 years ago

The accept() method on ILinkExtractor is indeed how you will tell your extractor to only treat JSON URLs (or whatever pattern of your choice). If you also use the GenericLinkExtractor, it will only handle HTML pages unless you configure it differently.

In the Importer module, it depends what you are doing. You can filter out JSON URLs if you do not need to keep them. Then you will only be processing your HTML pages in the importer module.

Does that answer your questions?

essiembre commented 8 years ago

Closing due to lack of feedback.