AnantLabs / google-enterprise-connector-sharepoint

Automatically exported from code.google.com/p/google-enterprise-connector-sharepoint
0 stars 0 forks source link

Connector fails traversing SharePoint document libraries with special characters in document metadata #50

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
It seems that this character causes the Google SP connector to abort 
processing the whole document library content with no error.

I prepared a small sharepoint library template that can be used to 
reproduce the problem.

The list template as been created with Windows SharePoint Services 3.0 
with Service Pack 1 (12.0.0.6219)

The template contains two dummy files by default:
gfs-sosp2003.pdf
google_problematic_file.pdf

The google_problematic_file.pdf Title property contains the problematic 
character.

The instructions to reproduce the problem are the following:

1. Create a new SharePoint Site Collection with a default website template 
2. Go to the list template gallery for the website that was just created 
and upload the attached template 
3. Create a new list in the website using the template 
4. Set up a Google SP connector (version 1.2 in our case) to crawl this 
location 
5. Force the connector to traverse by clearing the XML status file and 
restarting the service 
6. Check that despite the connector went through the library, no documents 
were reported as crawled in the crawl diagnostic section 
7. Delete the google_problematic_file.pdf file from the library 
8. Force the connector to traverse by clearing the XML status file and 
restarting the service 
9. Check that now the document library and all its content appear in the 
crawl diagnostic section

Another easiest way to reproduce the problem is to insert this character 
anywhere and a property of an existing document (no need to use the list 
template anymore).

Original issue reported on code.google.com by cyrille....@gmail.com on 20 Aug 2008 at 7:57

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by mwarti...@gmail.com on 23 Sep 2008 at 3:40

GoogleCodeExporter commented 9 years ago
The issue is reproducible for the special character present in the pdf with the
attached template.
This failure is because the SharePoint web service itself fails when trying to 
fetch
the documents under the given document  library.

We have tried to reproduce the same issue by putting different special 
characters
including Arabic, Hebrew, Chinese and other Latin characters in the document's
metadata. But, the connector's behavior is fine in these cases and does not 
report
any exceptions OR failure. 

Provide more details about the exact special character (like the character set) 
that
is causing this failure.

Original comment by amit.per...@gmail.com on 22 Oct 2008 at 2:23

GoogleCodeExporter commented 9 years ago
Hi Cyrille,
Could you please reply with more details about the exact special character 
(like the 
character set) that is causing this failure?

Regards,
Shashank

Original comment by shashank...@gmail.com on 18 Nov 2008 at 10:12

GoogleCodeExporter commented 9 years ago
Hi Shashank,

I investigated a little bit about that specific character.
It appears to be a vertical tab (ASCII 0x0B) "\\v".

It is basically an ASCII control character that is not valid from the XML point 
of 
view.

I found a lot of articles online about developers that encounter the same kind 
of 
problems when parsing XML content with such characters in the data.

They often simply skip the character.

The best article I found is 
http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invali
d-
character/ in which they make things pretty clear.

For the problem with the connector, the main issue is that the simple presence 
of 
such a character in a single document of a SharePoint library will prevent the 
connector to traverse the entire content of that document library, with no 
error 
reported in the logs.

As well as you, I also noticed that the SharePoint web service did like the 
character either. I found out the source of the problem thanks to errors trying 
to 
fetch the content of my library through the RSS view of SharePoint.

Hope this help.

Best regards,
Cyril

Original comment by cyrille....@gmail.com on 21 Nov 2008 at 11:43

GoogleCodeExporter commented 9 years ago
I think this is a generic connector manager issue:

http://code.google.com/p/google-enterprise-connector-manager/issues/detail?id=12
8

Original comment by jl1615@gmail.com on 11 Apr 2009 at 8:21

GoogleCodeExporter commented 9 years ago
This is a typical SharePoint web service issue as it fails when the response 
contains an invalid XML character in it. 

The CM issue mentioned above is very similar to this but do not have any direct 
relation with it. The reason being, the connector itself is not able to send 
anything to CM. Hence, no point of CM doing any parsing or validation of the 
data.

SharePoint conector makes list level calls to get the documents from SharePoint 
list/library. Failing of such call leads to no documents being received, no 
matter 
for whatever reason the call failed. Here, it's because of the invalid XML 
character 
in one of the document's metadata.

Original comment by th.nitendra on 13 Apr 2009 at 2:56

GoogleCodeExporter commented 9 years ago
Since it is a problem with SharePoint web service, this will not be fixed

Original comment by rakeshs101981@gmail.com on 29 Oct 2009 at 2:29

GoogleCodeExporter commented 9 years ago
The error should be handled more gracefully and reported appropriately in the 
logs.

Original comment by darsh...@google.com on 29 Oct 2009 at 10:51

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
We reproduced this scenario using the list template shared by Cyril.

Connector 2.0.x Behavior:

Web service gets a SAXException due to an invalid XML character in the web 
service 
response:
org.xml.sax.SAXParseException: Character reference "" is an invalid XML 
character

Due to the above exception, all the docs after the problebatic list get skipped.

This also sets the changeToken = null and total docs to be sent from list as 0. 
This 
further implies that all docs from current list are done and hence the last 
feed for 
the list itself is sent and change token is ‘null’ (Ideally it will be a 
value from 
where we can continue is future batch traversals)

Oct 28, 2009 12:47:05 PM [Traverse sharepoint-connector] 
com.google.enterprise.connector.sharepoint.spiimpl.SPDocumentList nextDocument
INFO: Sending DocID [ {20001A30-5624-4089-B7C4-FB36D1431020} ], docURL [ 
http://mycompany.com:80/records/mylist/Forms/AllItems.aspx ] to CM for ADD.

•         Since the list is marked as done and changetoken ‘null’, you 
will find 
that the same docs are re-sent in future batch traversals. For the web service, 
changeToken=null means you are starting fresh.
•         The same problem should hold true for other folders as well

To be fixed in 2.4 release.

Original comment by rakeshs101981@gmail.com on 31 Oct 2009 at 4:36

GoogleCodeExporter commented 9 years ago
Though, the problem can not be completely solved from connector end, connector 
must
recover from such exception and progress with the crawl without any loss of 
data.
There can be three approaches for this.

Approach1: Do not update the list's state unless the web service call starts 
getting
succeded. This will mean that the crawl will not proceed for the current list 
and
change detection will never initiated. 
This approach though, not a complete solution for the problem, is simple to 
implement
and has been done as a quick workaround for the problem. Refer to
http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
415

Approach2: Skip the current set of document, update the list's state with the 
next
expected value of LastDoc which is going to be (LastDoc + batchHint). This way, 
we'll
be able to progress with the crawl and change detection. But, the biggest 
drawback
with this approach is that a certain set of document will be escaped forever. 
These
documents will not be crawled even if, in future, the problem at SharePoint end 
is
resolved and web service call starts getting succeeded. 

Approach3: Split the current batchhint in sub-hints and make multiple web 
service
calls with smaller batchints and locate the problematic document, if any, using
binary search approach. This way we'll skip only the problematic documents 
without
any other loss of data. But, this may increase the web service calls heavily in 
worst
case scenarios. Also, this approach is suitable only for the SAXParseException 
and
should not be followed in other cases.Implementation is a bit complicated.

Original comment by th.nitendra on 5 Nov 2009 at 8:17

GoogleCodeExporter commented 9 years ago
I vote for Approach #3

Original comment by darsh...@google.com on 6 Nov 2009 at 12:20

GoogleCodeExporter commented 9 years ago
If #3 is going to take very long time, we can go for a #2 and log clearly which 
part of 
the was excluded. If even #2 does not fit in 2.4 timelines, then we need to 
have at 
least #1 in place.

Original comment by darsh...@google.com on 16 Nov 2009 at 10:15

GoogleCodeExporter commented 9 years ago

Original comment by rakeshs101981@gmail.com on 9 Dec 2009 at 9:50

GoogleCodeExporter commented 9 years ago

Original comment by j.dars...@gmail.com on 15 Dec 2009 at 11:59

GoogleCodeExporter commented 9 years ago

Original comment by j.dars...@gmail.com on 16 Dec 2009 at 12:00

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
A better solution could be to intercept the web service response before it 
reaches up to the connector. The interceptor will remove all invalid XML 
characters so that connector can do its job smoothly. Message Handlers comes to 
the rescue.

Message Handlers are pluggable components to the web service which can 
intercept the incoming and outgoing SOAP packets to add various Quality of 
Services. Handlers are to web service as filters are to the servlets and 
interceptors are to the EJB. The idea is to write a handler that intercepts the 
SOAP response to check if it has any invalid XML characters; If found, will 
replace all such characters with a fixed replacement value. This approach is 
far cleaner and simpler as compared to the earlier approaches that we have 
discussed. No mathematics or complicated algorithms are required.

The solution may seem unfeasible at first because even the handlers work with 
XML. Fortunately, Axis provides its own handler framework (apart from the one 
that JAX-RPC recommends) and these handlers can play their role right upfront 
before anything else is done with the incoming/outgoing packets. This also 
means that, these handlers do not necessarily work with XML; instead, they can 
do with strings as well.

All the deployment related configuration for the handler is done in Axis's 
client-config.wsdd file

What to replace and, with what?
-------------------------------
The main question to the handler is that what it should remove from the 
response and what to put in place of the filtered characters. Typically, all 
invalid XML characters should be replaced. But, doing that would require 
finding all those characters which can cause parsing to fail. Instead of doing 
that, it would be better if users at the time of deployment decide what should 
be replaced. Hence, the client-config.wsdd contains the following information:

1) patterns to be replaced

2) the replacement value. This will be used with all the replacements

With a little more effort, we can make the solution much better. From Issue 50, 
we are already aware of certain invalid XML characters that always cause 
parsing to fail. If we hard code checks for at least such characters in the 
implementation, we can reduce significant deployment overhead. As per the 
investigation, the characters that Issue 50 talks about are called Character 
entity references. References with following integer values are invalid XML and 
causes parsing to fail:

0 to 8

11 to 12

14 to 31

55296 to 57344

65534 and above

The above knowledge can be incorporated into the implementation eliminating the 
need to specify patterns for such invalid references in client-config.wsdd.

Some Optimization
-----------------
Message Handler, once configured, intercepts every request and/or response 
concerning the web service. It would be nice if the invocation of handler can 
be done only if required. This would help in cases where the response does not 
contain invalid XML characters. Client can make a normal WS call without 
expecting the handler to intercept the call. If the call fails while parsing 
the response, caller can make a second attempt for the same WS call, this time 
requesting handler to come in action. This can be achieved using a SOAP Headers 
called PRECONDITION_HEADER. The handler will be designed to work only if a 
PRECONDITION_HEADER is present in the request.

Original comment by th.nitendra on 7 Sep 2010 at 9:10

GoogleCodeExporter commented 9 years ago
Issue is fixed. Revision:
http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
878
to
http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
882

Original comment by th.nitendra on 6 Oct 2010 at 7:16

GoogleCodeExporter commented 9 years ago

Original comment by Shweta.v...@gmail.com on 6 Oct 2010 at 8:04

GoogleCodeExporter commented 9 years ago

Original comment by deshpa...@google.com on 6 May 2011 at 12:32