Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Get/ Set current dateTime in config.xml #704

Closed stds1a28 closed 4 years ago

stds1a28 commented 4 years ago

Hi everyone,

I am a beginner of norconex http collector, try to modify the HTML and execute run.bat, and I am using solr to check the result.

It seem that I have to change the "crawler id" each time otherwise the data will be covered, than I try to use current dateTime as id

This is what i try to do: crawler id="datetime()" But I have no idea how to write the config.xml "${variable}" become String =.=

after google, CurrentDateTagger may be an potion, but i don't know how to call the api T_T can anyone give me an example ?

stds1a28 commented 4 years ago

@essiembre Hi Sir, guess you are the boss here, please safe my life T_T

essiembre commented 4 years ago

I would need a better understanding of what you are trying to do. Typically, you want your initial crawl to crawl everything matching your config settings, then subsequent crawls to only capture additions, modifications, and deletions, updating your Solr index accordingly, making sure there are no duplicates.

It seems you want to do the opposite? I.e., always crawl everything and never overwrite previously crawled documents, effectively causing duplicates (multiple instances of the same document in your index, but crawled at different times).

If I understand you correctly, there are several ways to do this. Here are a couple.

Recrawling everything: The easiest way is to wipe out your "workdir" before you start the HTTP Collector. More specifically, the crawl store. That will make it lose the history of whatever was crawled before. Another option is to disable creating checksums for documents, so it has nothing to compare them on each crawl. This can be done by adding this to your crawler section:

<metadataChecksummer disabled="true"/>
<documentChecksummer disabled="true"/>

Keeping all instances of a document: You started well on that one by making the id dynamic. A more robust approach would be to use a UUID. In your importer section (as pre or post handler), you can use the UUIDTagger. If you want to keep track of when each instance of a document was crawled, I recommend using CurrentDateTagger in a separate field (keep UUID as the document reference).

stds1a28 commented 4 years ago

Hi Sir,

Thanks for your reply, sorry for the confusion. This is my config.xml:

<crawler id= "1" >  
    <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="false" stayOnProtocol="false">                                  
        <url>http://localhost/abc.html</url>
       </startURLs>
       <maxDepth>1</maxDepth>

let say I want to add new url def.html, my config will be:

       <crawler id= "2" >
        <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="false" stayOnProtocol="false">
        <url>http://localhost/abc.html</url>                                    
        <url>http://localhost/def.html</url>
       </startURLs>
       <maxDepth>1</maxDepth>

In this case, I have to change the crawler id each time if i do some change(add/update/del).

Or I can do this:

       <crawler id= "1" >
        <metadataChecksummer disabled="true"/>
        <documentChecksummer disabled="true"/>
        <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="false" stayOnProtocol="false">
        <url>http://localhost/abc.html</url>                                    
        <url>http://localhost/def.html</url>
        </startURLs>
        <maxDepth>1</maxDepth>

In this case, I don't have to change the crawler id. But if I remove the abc.html, the date will also gone.

So I better do this:

        <crawler id= CurrentDateTagger ( here should be current dateTime, I don't know how to import or call this API ) >   
     <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="false" stayOnProtocol="false">
         <url>http://localhost/abc.html</url>                                       
         <url>http://localhost/def.html</url>
         </startURLs>
         <maxDepth>1</maxDepth>

In this case, I don't have to change the crawler id. even I remove the abc.html, the date will still exist.

Am I thinking right ?

essiembre commented 4 years ago

You do not need to change the crawler ID if you add/remove URLs in your config, they will be added/removed from your index on the next crawl.

stds1a28 commented 4 years ago

Can you just share your code about set current datetime as crawler ID ? I try to google "CurrentDateTagger" before but still don't know how to do T__T plz x 10000 @essiembre

essiembre commented 4 years ago

I am trying my best to help, but I am still struggling to understand what you are trying to achieve. The crawler ID is not meant to change so there is no out-of-the-box way to dynamically update it (using timestamp or else).

The only reason I see why you may want to do this is to start with a "fresh" crawl each time. If so, disabling the checksummers as suggested earlier should do the same for you. Another approach is to wipe out your "workdir" before running the crawler. That will be the same as if you never ran it before.

stds1a28 commented 4 years ago

I just add a complex-config.variables and it work : <crawler id=${dateTime}>

but after I create a new dir and run the bat, the "backup" file is missing, only "latest" file in norconex-collector-http-2.9.0\newFileName-output\complex\logs which code handle the backup file location ?

@essiembre

essiembre commented 4 years ago

Backup-related code is located here.

stds1a28 commented 4 years ago

thank you, boss <3