Reference filters - Githubissues

bmfirst commented 4 years ago

Hi Pascal,

First of, thank you for the excellent software.

I want to crawl a very large site (10M+ pages) and i want to avoid all the search query links (containing ?, multiple keywords) and all profiles links to speed things up. I want only videos links(site.com/video*/) to be crawled and saved. I was using your guide from one of the other issue but unsuccessfully.

Those are my reference filters

<!-- Before download: -->
<referenceFilters>
    <!-- Include video urls: -->
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*/video*
    </filter>
    <!-- Include only URLs with one segment, no question mark: -->
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        ^https?:\/\/([^\/]+?)(\/)([^\/\?]+)$
    </filter>
</referenceFilters>

When i run minimal command nothings is crawled, system finishes fast(zero results) and i get error:

REJECTED_FILTER: https://www.site.com (No "include" reference filters matched.)

Can you please help me, what am i doing wrong?

Best regards

bmfirst commented 4 years ago

Corrected code:


<referenceFilters>
    <!-- Include seo urls: -->
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*/video*
    </filter>
    <!-- Include only URLs with one segment, no question mark: -->
</referenceFilters>```

essiembre commented 4 years ago

What is your startURL? If it does not have the "video" portion it in, it will be rejected right away and will never get to your videos.

If you want links on all pages to be "followed" but only have the video ones kept, I suggest you replace referenceFilters with documentFilters. See, for links in a page to be extracted and followed, that page has to be downloaded first. Reference filters occur before documents are downloaded.

This flow may give you additional insights: https://norconex.com/collectors/collector-http/flow

bmfirst commented 4 years ago

Dear Pascal, thank you for your reply.

My start url is below: ``

https://www.dailymotion.com

  </startURLs>``

Typical video link is https://www.dailymotion.com/video/x7uz46f

I want only videos to be crawled/saved in order to make process as fast as possible. I read about DocumentFilters but I would like to avoid crawling all pages if that is possible(but if that is necessary in order to index all videos then i would prefer to index all pages). I used ExactMatch option but then process got stuck at the beginning, which should be solved by adding video to the start URL as you wrote. Should it contain * to include all links? I`m not sure why so many slashes in the command.

Please tell me do you have any advice how to speed process besides reducing delay time(i do not want to get banned :)) (by using more threads maybe) ?

Regards

essiembre commented 4 years ago

When setting up those kinds of filters, you have to distinguish between two needs:

The need to "follow" links to get to the pages you are interested in.
The need to only "keep" the pages you are interested in.

You cannot filter everything out via reference filters as you will not make it to your desired pages. So...

To follow links in a page without keeping it: For links to be followed they have to be extracted from a page. That means the page containing the links has to be downloaded. That rules out referenceFilters since those are actioned on URLs before the page is downloaded.

Reference filters are the fastest way to eliminate pages you do not want, assuming you do not need to follow their URLs. Typical use cases for using reference filters are:

eleminate portions of your site you do not want to crawl
specific URLs with unwanted extensions

So reference filters are the best when you can use them to make crawls much faster by having less content to download.

When following links, you have to use something else to filter, like documentFilters. These will filter out matching URLs after their pages have been downloaded (and links extracted). In your case, it would be a filter rejecting all pages with a URL not containing "/video/".

I hope it makes things a bit clearer. If not, please share what you've tried with a config to reproduce.

bmfirst commented 4 years ago

Hi Pascal, thank you for reply.

Then i need all links to be extracted so i will go only with documentFilters as below

<!-- After download and after links are extracted: -->
<documentFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*/video/.*
    </filter>
</documentFilters>

I have been processing some awkward sitemaps:

When i also change the starturl to include /video/ INFO [StandardSitemapResolver] Resolving sitemap: https://www.dailymotion.com/map-videos-deleted-000023.xml.gz ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://www.dailymotion.com/map-videos-deleted-000023.xml.gz -- Likely an invalid sitemap XML format causing a parsing error (actual error: Invalid UTF-8 start byte 0x8b (at char #2, byte #-1)).

My config file is attached minimum-config.txt

Can you please point me out? And can you been kind to tell me how to speed process(i do want to crawl site 1 year :)

Regards

essiembre commented 4 years ago

Thanks for your config. I was able to reproduce. It turns out the content-type returned by the website for the compressed sitemap files was not accurate. It does not indicate gzip compression as it should. I modified the sitemap resolver to also check the URL extension of a sitemap in addition to content type and I could test it was working now fine.

That brought to my attention that the site appears to use sitemaps that may already contain what you are looking for. Loot at the bottom of the https://www.dailymotion.com/robots.txt file and you will see all sitemap indices offered by the site:

https://www.dailymotion.com/map-videos-latest.xml
https://www.dailymotion.com/map-videos-default.xml
https://www.dailymotion.com/map-videos-deleted.xml
https://www.dailymotion.com/map-topics-strategic.xml
https://www.dailymotion.com/map-channels-strategic.xml
https://www.dailymotion.com/map-pages-default.xml

I suggest you download/extract a few and have a look. Do the sitemaps seem reliable to you? If so, you have no need to crawl everything (i.e. extracting URLs and following them). You can limit it to the sitemaps only, with <maxDepth> being 0. You can use <sitemap> instead of <url> in your start URLs. Then you can use the reference filters safely since you do not have to crawl everything just to "discover" those URLs anymore. It can possibly make your job much easier.

I made a new snapshot 2.9.1-SNAPShOT release with the fix to the error. Please give it a try and confirm.

bmfirst commented 4 years ago

Hi Pascal, thank you very much, but we are a bit deviating from the subject.

I want to crawl metacafe.com, link is as following https://www.metacafe.com/watch/12101289/riding-in-a-boat-on-a-wild-river/

So I will crawl everything, i will just save videos, can you please tell me if the following will do the job:


<documentFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*/watch/.*
    </filter>
</documentFilters> ```

Can you please tell how to speed thing up?

essiembre commented 4 years ago

Assuming this site does not offer a sitemap.xml, a few pointers I can think of:

Make sure you stayOnSite.
Reduce the <delay> between hits (default is 3 seconds) and changes the "scope" attribute to be "thread".
Increase the number of threads.
Use <referenceFilters> so it does not attempt to follow links that are not related (do not lead you to videos). If any.

bmfirst commented 4 years ago

Hi Pascal, thank you :)

Can you please help me, if i want to crawl a site which videos have link watchID (where ID are unique number for each video)

will the saving command for videos would be as below?

<documentFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*/watch*/.*
    </filter>
</documentFilters>

Also, do you have some quick formula to calculate how much would it take to crawl 1 milion pages?

Can you please tell me is there any command to save only list of links without any other data or logs (do you reccommend SQL Commiter)?

Regards

essiembre commented 4 years ago

About the time it takes, there is no such formula as you usually do not control all the pieces, such as network latency, bandwidth capacity/congestion, website possibly throttling requests, website throughput capacity, varying sizes of files crawled, etc. That being said, you can look at the logs when the crawl is in progress and see the elapsed time so far and how many documents were processed. You can extrapolate to obtain an estimate. Just know the accuracy of the time you get may vary greatly.

For keeping only the fields you want, you can use the Importer handler called KeepOnlyTagger.

bmfirst commented 4 years ago

Thank you very much for the reply and support Pascal.

Do you mind just telling me is the command to save only videos correct in case of link being /watchID/

<documentFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*/watch*/.*
    </filter>
</documentFilters>

I did not found field name for link, is it "href"?

<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
      <fields>href</fields>
 </tagger>

Regards

essiembre commented 4 years ago

Your regular expression is not valid for what you want to do. It should rather be: .*/watch.*/.* (you were missing a dot). I recommend you test your regular expressions before trying them in a crawl. For example: https://regex101.com/r/t2FYR8/1

You can reference the document URL using document.reference.

Have a look at the DebugTagger. You can insert it anywhere as a pre or post-parse handler in the importer module to print out fields gathered at that point for a document.

bmfirst commented 4 years ago

Hi Pascal,

Thank you very much for the reply.

I`m having error "stayOnSite="true" is not allowed in startURLs. And i have stayOnDomain="true" also defined, is that excessive when i use stayOnSite?

Can you clarify with what "threads" are connected? Do my CPU must have specific number of threads as I defined in config?

Also, should log file be saving only document.filter (URL links), not all progress logs?

Regards

essiembre commented 4 years ago

Sorry for the misleading advice: the stayOnSite flag does not exist. I meant to say "make sure you stay on site". Actual options for this are here.

Your CPUs do not have to match your number of threads. Obviously though, the more CPUs you have the more threads they should be able to handle.

You can control what gets written in the log files by modifying the log levels in the log4j.properties file.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Norconex / crawlers

Reference filters #707