Closed liar666 closed 8 years ago
Concerning extraction from JSON, I found this post: https://www.norconex.com/how-to-crawl-facebook/ I don't really like the regex option to parse JSON, I would prefer to work directly with the JSON Object. Since this is examplified in the second piece of code, I should be able to do what I want with this post. So you can ignore question 2 :)
Question 1: Not sure I understand. You can have a seed file with ANY URLs in it so those URLs can definitely be queries with arguments. Are you saying the list itself is not known upfront and you want to generate it before a crawl? Right now you can not have a custom "start url generator" but you can easily achieve the equivalent a few different ways. You can create a dynamic webpage that produces your list of dynamic URLs and have that new page be your start URL. You can also use an external process of your own to create the seed file with URLs you want before launching the crawler (you can automate running one before the other via simple shell scripting).
Question 2: Ignored :-)
Question 3: Yes, but URL extraction is done as a separate step from "importing" (since sometimes you want to extract and follow URLs in a document you otherwise don't want to import). For this, you can implement your own link extractor. By default, GenericLinkExtractor is used, but you can create your own ILinkExtractor implementation. You can define multiple link extractor in the same configuration (e.g., you want to extract links from both HTML and JSON content).
Question 4: Yes it is possible, see previous answer.
Hi again,
I've almost finished writing the crawler for type of sites I described initially (i.e. JSON+HTML), but I still have 2 questions:
3) I've moved the second question in another issue, since it's quite different from the initial question I asked here. See https://github.com/Norconex/collector-http/issues/258
The accept() method on ILinkExtractor is indeed how you will tell your extractor to only treat JSON URLs (or whatever pattern of your choice). If you also use the GenericLinkExtractor, it will only handle HTML pages unless you configure it differently.
In the Importer module, it depends what you are doing. You can filter out JSON URLs if you do not need to keep them. Then you will only be processing your HTML pages in the importer module.
Does that answer your questions?
Closing due to lack of feedback.
Hi,
Since I got very good answers to my previous questions, I'll add some :)
I'm faced with another problem: I have a site that acts like a modern search engine, with dynamic generation of the HTML results-pages from JSON response from an API. What the sites does behind the scene, and that you have to reproduce to crawl it, is:
My questions with respect to crawling such a site are: