How to store Start URLs

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 67 forks source link

How to store Start URLs #514

Closed HappyCustomers closed 5 years ago

HappyCustomers commented 6 years ago

Hi ,

I am using HTTP collector and MYSQL committer along with document fetcher to crawl and index web pages. everything is working fine , however I have one requirement where I need to store startURLs along with other fields for each page as below

id, Content, title, meta Description, Keywords, imagepath, StartURls 
www.xyz.com/aboutus, aboutus content, xyz company, xyz description, xyz keywords, //imagepath, www.xyz.com
www.xyz.com/careers, careers content, xyz title, xyz description, xyz keywords, //imagepath, www.xyz.com

Thank you

essiembre commented 6 years ago

By start URL, do you mean the domain? If it is not available as an extracted field already, you can derive it from the URL using regular expression. Have a look at ReplaceTagger.

To only have the fields you want, you may want to use the KeepOnlyTagger.

If you first want to rename them to names of your choice, have a look at the RenameTagger.

Does that answer?

HappyCustomers commented 6 years ago

startURLs meam the URLs in the startURL tag. It can be domain or any other start URL used to extract web pages. Example

www.xyz.com www.abc.com/city

As we are going to load hundreds of start URLs using text file ,we need to link the start URLs to the indexed URLs.

Tried checking the ReplaceTagger, I don't get the startURL in the fromfield.

essiembre commented 6 years ago

That feature is not present. It could prove challenging in some cases if a few different start URLs point to the same page for example.

One workaround would be to somehow automate the launching multiple crawlers instead (admittedly maybe less practical) and use the ConstantTagger to define which one is which.

Also, do you know if you will always have different domains? If so the domain approach suggested before can be used to help you identify each.

How many level deep do you crawl? Since you have hundreds of start URLs, if you go only 1 level deep, you will get the start URL in a collector.referrer-reference field.

If you crawl deeper and you don't mind getting the start URLs in a post-index SQL query, you can use that field to reconstruct the full crawl path of a document (including the start URL).

If none of the above can work for you, we can make this a feature request.

HappyCustomers commented 6 years ago

All the URLs will be of different Domain. 2 Level deep is sufficient for crawling`. It would be a great feature if we can store the startURL in MYSQL against each indexed page. Thank you

essiembre commented 6 years ago

I am marking this as a feature request to store the start URL or the full crawl path to the document (as opposed to just the direct parent).

That said if none of your URLs share the same domain, and you have "stayOnDomain=true", I would look at extracting the domain from the URL using the ReplaceTagger and storing it in a new field as suggested before. You will be able to filter by domain quickly.

HappyCustomers commented 6 years ago

Can you Please provide me the configuration using Replace Tagger ??

essiembre commented 6 years ago

Here is an example (untested):

...
<importer>
    ...
    <preParseHandlers>
        ...
        <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
            <replace fromField="document.reference" toField="MyCustomDomainField"
                        regex="true" wholeMatch="true">
                <fromValue>https?://(.*?)(/.*|:.*|$)</fromValue>
                <toValue>$1</toValue>
            </replace>
        </tagger>
        ...
    </preParseHandlers>
    ...
</importer>
...

HappyCustomers commented 6 years ago

Thank you and I will try and get back to you

HappyCustomers commented 6 years ago

I tried it is working if the start URL is http://www.xyz.com/

however if the start URL is http://www.xyz.com/city/, it is extracting as http://www.xyz.com/

Can this be taken care also in the above config.

Thanks in Advance

essiembre commented 6 years ago

I tried the exact config snippet and it worked for me. I am getting this value:

MyCustomDomainField = www.xyz.com

You can try adding a DebugTagger just after to print all fields (or just "MyCustomDomainField"). That may help you troubleshoot.

If you can't resolve it, please share your config.

HappyCustomers commented 6 years ago

Sorry for not communicating clearly earlier. if the start URL is http://www.xyz.com/city/ then the ReplaceTagger extract only www.xyz.com and not the full start URL - http://www.xyz.com/city/

the Data in the table must be like this Page URLs ------------------------------------------ Start URL http://www.xyz.com/city/aboutus.html -------- http://www.xyz.com/city/ http://www.xyz.com/city/product.html -------- http://www.xyz.com/city/ http://www.xyz.com/city/contactus.html -------- http://www.xyz.com/city/

essiembre commented 6 years ago

Ha... it is just a matter of adjusting your regular expression to match exactly what you want then.

Regular expressions are a very popular way to match text and you can find plenty of good documentation online if you are not too familiar. You can also find different regular expression testers online, such as these two:

You can try your different text matching use cases there before trying them in the Collector.

To help you with this one, you can try expanding the first match group by changing the regular expression to:

(https?://.*?)(/.*|:.*|$)