The discovery of site collection should be initiated in the first crawl cycle even though no content is crawled

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Install the Google SharePoint connector 2.0
2. Install the Google Services for SharePoint on the SharePoint server
3. Create a connector instance with crawl URL of root site collection which 
does not have any content under it
4. Save the configuration and let connector crawl content

What is the expected output? 
Connector should initiate the discovery of all site collections even though  
no content is crawled

What do you see instead?
The connector does not create a state file as no content is available to be  
crawled.
The discovery of site collections is not initiated as it is set to discover 
after one complete traversal cycle

What version of the product are you using? On what operating system?
SharePoint Connector 2.0

Please provide any additional information below.

Original issue reported on code.google.com by rakeshs101981@gmail.com on 27 Jul 2009 at 6:47

GoogleCodeExporter commented 9 years ago

Original comment by rakeshs101981@gmail.com on 27 Jul 2009 at 6:48

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Method: SharepointClient.updateGlobalState(final GlobalState globalState)

if(globalState.isBFullReCrawl() && null != spType) {
            LOGGER.log(Level.INFO, "Discovering Extra webs");
            discoverExtraWebs(allSites, spType);
}

Need to add flag to check that if it is the first crawl cycle and 
globalState.isBFullReCrawl() is false, the discovery of site collections should 
still 
happen

One approach can be initiate this in 
SharepointTraversalManager.startTraversal() if 
the DocumentList returned by doTraversal() is empty (size==0).

Original comment by rakeshs101981@gmail.com on 27 Jul 2009 at 7:01

GoogleCodeExporter commented 9 years ago

Original comment by rakeshs101981@gmail.com on 7 Aug 2009 at 2:19

Added labels: Milestone-2.4

GoogleCodeExporter commented 9 years ago

Original comment by rakeshs101981@gmail.com on 19 Aug 2009 at 4:31

GoogleCodeExporter commented 9 years ago

The above code snippet can be changed as following to ensure the discovery of 
extra webs.
if (doCrawl && null != spType)

The next thing required is the traversal of the discovered sites. Without that, 
the
no. of documents that will be sent to CM will still be zero. And hence, the 
problem
of checkpoint() not being called and a repetitive call to startRecrawl() will 
still
persist. There has to be a trigger that will actually make the traversal 
process to
fetch docs. The value of nDocuments in SharepointClient could be made use of. 

But, it will still have one issue: If in between any batch traversals this 
value is
0, then, the traversal process will initiate the discovery of new sites. 

Owing to above reasons, there is another alternative:
- Keep the traversal logic in SharepointClient.updateGlobalState() as-is
- Check the size of SPDocumentList in startTraversal. If it is 0 and if the
SharePoint type is 2007 (WSS 3.0 or MOSS)
   * Initiate the discovery of new sites,
   * Update global state with the newly discovered sites
   * Call SharepointClient.updateGlobalState() to initiate traversal of newly
discovered sites

This approach can be less error prone as the existing flow of execution is not
getting hampered directly. But, it addresses the issue of only one use case i.e 
when
the crawl URL specified is fully empty.

Another approach could be of discovering the extra webs as soon as the traversal
cycle is completed and the no. of discovered documents is less then the batch 
hint.
Currently, connector waits for the next traversal request to crawl the newly
discovered webs. This, will not only solve the current issue, but also speed up 
the
connector's traversal.

following changes will be required:

boolean isGSupdated = updateGlobalState(globalState, allSites);
if (doCrawl && null != spType) {
        if (!isGSupdated) {
            discoverExtraWebs(allSites, spType);
                isGSupdated = updateGlobalState(globalState, allSites);
        }
        if (isGSupdated) {            
             <initiate crawling of the newly discovered webs>
        }
}

Original comment by th.nitendra on 1 Sep 2009 at 2:12

GoogleCodeExporter commented 9 years ago

Cases being handled here:
1. Batch hint # of documents have not been discovered, but there are
new sites which have been discovered. Crawl documents till you get
the batch hint # of docs
2. Batch hint # of documents have not been discovered and no new
sites have been discovered. In such cases get any new
personal/mysites, sites discoevered by GSS. Add them to the global
state and crawl them till batch hint # of documents is reached.

if (doCrawl && null != spType) {
       // If the first check has passed, it might mean Case 1. If the
       // following if block is skipped, it means this is Case 1, else it
       // will be Case 2
       if (!isGSupdated) {
       // If this check passed, it means Case 2
            discoverExtraWebs(allSites, spType);
                isGSupdated = updateGlobalState(globalState, allSites);
        }

       // The following does not care if the sites are discoevered for Case
       // 1 or Case 2. It will simply go ahead and crawl batch hint no. of
       // docs from the new sites
        if (isGSupdated) {            
             <initiate crawling of the newly discovered webs>
        }
}

Original comment by rakeshs101981@gmail.com on 2 Sep 2009 at 3:10

GoogleCodeExporter commented 9 years ago

Fix details:

http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
384

Original comment by rakeshs101981@gmail.com on 5 Nov 2009 at 8:59

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Verified in 2.4 Release

Original comment by ashwinip...@gmail.com on 14 Dec 2009 at 6:36

Changed state: Verified

AnantLabs / google-enterprise-connector-sharepoint

The discovery of site collection should be initiated in the first crawl cycle even though no content is crawled #85