Closed HappyCustomers closed 5 years ago
By start URL, do you mean the domain? If it is not available as an extracted field already, you can derive it from the URL using regular expression. Have a look at ReplaceTagger.
To only have the fields you want, you may want to use the KeepOnlyTagger.
If you first want to rename them to names of your choice, have a look at the RenameTagger.
Does that answer?
startURLs meam the URLs in the startURL tag. It can be domain or any other start URL used to extract web pages. Example
www.xyz.com www.abc.com/city
As we are going to load hundreds of start URLs using text file ,we need to link the start URLs to the indexed URLs.
Tried checking the ReplaceTagger, I don't get the startURL in the fromfield.
That feature is not present. It could prove challenging in some cases if a few different start URLs point to the same page for example.
One workaround would be to somehow automate the launching multiple crawlers instead (admittedly maybe less practical) and use the ConstantTagger to define which one is which.
Also, do you know if you will always have different domains? If so the domain approach suggested before can be used to help you identify each.
How many level deep do you crawl? Since you have hundreds of start URLs, if you go only 1 level deep, you will get the start URL in a collector.referrer-reference
field.
If you crawl deeper and you don't mind getting the start URLs in a post-index SQL query, you can use that field to reconstruct the full crawl path of a document (including the start URL).
If none of the above can work for you, we can make this a feature request.
All the URLs will be of different Domain. 2 Level deep is sufficient for crawling`. It would be a great feature if we can store the startURL in MYSQL against each indexed page. Thank you
I am marking this as a feature request to store the start URL or the full crawl path to the document (as opposed to just the direct parent).
That said if none of your URLs share the same domain, and you have "stayOnDomain=true", I would look at extracting the domain from the URL using the ReplaceTagger
and storing it in a new field as suggested before. You will be able to filter by domain quickly.
Can you Please provide me the configuration using Replace Tagger ??
Here is an example (untested):
...
<importer>
...
<preParseHandlers>
...
<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
<replace fromField="document.reference" toField="MyCustomDomainField"
regex="true" wholeMatch="true">
<fromValue>https?://(.*?)(/.*|:.*|$)</fromValue>
<toValue>$1</toValue>
</replace>
</tagger>
...
</preParseHandlers>
...
</importer>
...
Thank you and I will try and get back to you
I tried it is working if the start URL is http://www.xyz.com/
however if the start URL is http://www.xyz.com/city/, it is extracting as http://www.xyz.com/
Can this be taken care also in the above config.
Thanks in Advance
I tried the exact config snippet and it worked for me. I am getting this value:
MyCustomDomainField = www.xyz.com
You can try adding a DebugTagger just after to print all fields (or just "MyCustomDomainField"). That may help you troubleshoot.
If you can't resolve it, please share your config.
Sorry for not communicating clearly earlier. if the start URL is http://www.xyz.com/city/ then the ReplaceTagger extract only www.xyz.com and not the full start URL - http://www.xyz.com/city/
the Data in the table must be like this Page URLs ------------------------------------------ Start URL http://www.xyz.com/city/aboutus.html -------- http://www.xyz.com/city/ http://www.xyz.com/city/product.html -------- http://www.xyz.com/city/ http://www.xyz.com/city/contactus.html -------- http://www.xyz.com/city/
Ha... it is just a matter of adjusting your regular expression to match exactly what you want then.
Regular expressions are a very popular way to match text and you can find plenty of good documentation online if you are not too familiar. You can also find different regular expression testers online, such as these two:
You can try your different text matching use cases there before trying them in the Collector.
To help you with this one, you can try expanding the first match group by changing the regular expression to:
(https?://.*?)(/.*|:.*|$)
Hi ,
I am using HTTP collector and MYSQL committer along with document fetcher to crawl and index web pages. everything is working fine , however I have one requirement where I need to store startURLs along with other fields for each page as below
Thank you