AnantLabs / google-enterprise-connector-sharepoint

Automatically exported from code.google.com/p/google-enterprise-connector-sharepoint
0 stars 0 forks source link

Re-crawling of document becasue of the way Change Token is handled #137

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
There are two use cases:

First, when the connector goes into incremental crawl mode at the very
first time. At that time, the very first Change Token that was received as
per the first WS call is picked for initiating the incremental crawl. This
change Token, apart from other relevant changes, will also track those
document which have been added after the first WS call was made during the
initial crawl. Hence, all such document are re-crawled by the connector
assuming they are the new ones added on the SharePoint. Connector should in
such cases ignore all those documents which are being tracked as per the
change type ADD and whose IDs are lower than the last biggest doCID that
connector had crawled during the initial crawl.

Second, During the incremental crawl, a document can be modified twice at
some interval of time. And, there could be some other documents that have
been changed in between. If the current window of change log (which is
defined by the variants: Change Token and RowLimit) which is being
processed covers only the first change of the document and the second
change is get tracked while processing the next change log window, the
document will be crawled twice. And, both time, the most recent copy of the
document will be sent which is un-necessary.

Original issue reported on code.google.com by th.nitendra on 11 Jan 2010 at 11:44

GoogleCodeExporter commented 9 years ago
There is a third case of folder rename:

Suppose there is a folder hierarchy like this:

- Folder1
  - Folder11
    - Folder111
    - Folder112
  - Folder12
    - Folder121
    - Folder122

Say, Folder1 is renamed and connector starts crawling all the documents spanned 
Folder1, 111, 112, 12, 121 and Folder 122. At some time connector completes 
traversing all the documents from Folder11 and its subfolders and heads for 
Folder12. By this time Folder12 also gets renamed. At this point, the current 
traversal of the connector (initiated due to Folder1 rename) is sufficient to 
reflect the rename of Folder12 as well. But, there is no such intelligence in 
the connector. Instead, it completes the current traversal and then 
re-traverses Folder12 which is not at all required.

Original comment by th.nitendra on 17 Nov 2010 at 10:21

GoogleCodeExporter commented 9 years ago

Original comment by shashank...@gmail.com on 18 Mar 2011 at 12:05

GoogleCodeExporter commented 9 years ago
This issue is filed as Google issue #6513810

Original comment by tdnguyen@google.com on 18 May 2012 at 12:11