Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

How to effectively follow links; Extract contents out of binary files #610

Closed joettt closed 4 years ago

joettt commented 5 years ago

Hello Pascal,

I have a list of about 180000 user profiles with each page navigation displaying 10 users profiles at a time that I want to index. Each user list navigation page is tagged as noindex,follow. At the bottom of each page there are links to navigate to Page 2, Page 3, etc. with each navigation link with an offset value of 10, 20, 30 etc., for example, https://gcconnex.gc.ca/members?offset=10# The crawler seems to be following all of these navigation pages first before going to the actual users' profile pages that I want to index. Due to some performance issues that we are currently facing, it takes forever to index the actual profile pages and a lot of time is spent by the crawler following the navigation pages links and rejecting them since they are not supposed to be indexed anyway.

Question 1: Can I force the crawler to somehow first index the users' profile pages that are linked from the navigation page https://gcconnex.gc.ca/members?offset=10# before it follows the next navigation page https://gcconnex.gc.ca/members?offset=20#? Is there a more efficient way (other than manually feeding all the profile page links to the crawler) of indexing the profile pages contents?

Please see the attached config file.

Question 2: For the web pages, I am cherry picking the contents out of the DOM. How do I pick up the document properties and contents from pdf, rtf, zip, microsoft office documents? config.txt

Thank You in advance for your time.

essiembre commented 5 years ago

Question 1: I could not see your example URLs as they are down for me when I access them. The crawler performs a breadth-first traversal. For non-rejected pages, each thread extracts links on a page, then store those links in a queue. When done processing the page, it grabs the next link from the queue (FIFO) and so on. As for the order in which links are followed in a page, it goes with the order the links are discovered. If you find it slow, it may be because you have only 2 threads (the default). Increase the number of threads. With a delay of 0.1 seconds though, that can be a cause for the performance problems if your site cannot handle that many requests per seconds.

Question 2: Document properties and content are extracted automatically for any document (text or binary). Are you asking how you can limit what is extracted? If so, for properties, have a look at KeepOnlyTagger. For choosing only parts of the content, have a look at transformers available in the Importer module, such as: StripBetweenTransformer and ReplaceTransformer.

Does that answer?

Please in the future have different tickets for different questions.

joettt commented 5 years ago

Thanks for you response. Point noted and i will limit one question per ticket in future.

By increasing the number of threads, crawling is much faster now. For the second question what I was asking was how to capture the contents and meta tag values from the binary files. For the text files, I am able to pick up the contents from the selector "main" out of the DOM and save the contents in the field tbs_content with the following configuration: `

` But I don't know the name of the selector for the binary files that also I would like to save in the tbs_content field. I want to do the same for meta tags and save the contents in the appropriate meta tag fields, e.g. title should be saved in the field tbs_title. I have looked at StripBetweenTransformer examples online but they are about stripping text from text files. I am looking for an example, where I can retrieve the text from binary files and store data in desired fields.

essiembre commented 5 years ago

I think I get what you are trying to do. Once a document goes through "parsing", it gets transformed to raw text (no formatting) and metadata. Before parsing, they are in their native format and most operations are not really applicable if they are meant to be applied on text only. If there are no <restrictTo...> default, you may need to add one to your text-only handlers. Otherwise, if you have handlers that deal with text only and you apply them to binary files, you could mess them up and they will not be parsed properly.

So for binary files, you are best to have them as <postParseHandlers. You will them get all the metadata already extracted, but the content will be without formatting. So DOMTagger is not applicable in this context (the extracted content is not HTML/XML).

So I am afraid you will have to rely on text-patterns to extract what you want from binaries.

essiembre commented 5 years ago

@joettt, I could not find your email in your GitHub profile. Do you mind sharing it with me (you can email it to me using the email on my profile)?

joettt commented 5 years ago

Hi Pascal, I have added my email address in my Github profile.