Closed GoogleCodeExporter closed 9 years ago
Original comment by mwarti...@gmail.com
on 23 Sep 2008 at 3:40
The issue is reproducible for the special character present in the pdf with the
attached template.
This failure is because the SharePoint web service itself fails when trying to
fetch
the documents under the given document library.
We have tried to reproduce the same issue by putting different special
characters
including Arabic, Hebrew, Chinese and other Latin characters in the document's
metadata. But, the connector's behavior is fine in these cases and does not
report
any exceptions OR failure.
Provide more details about the exact special character (like the character set)
that
is causing this failure.
Original comment by amit.per...@gmail.com
on 22 Oct 2008 at 2:23
Hi Cyrille,
Could you please reply with more details about the exact special character
(like the
character set) that is causing this failure?
Regards,
Shashank
Original comment by shashank...@gmail.com
on 18 Nov 2008 at 10:12
Hi Shashank,
I investigated a little bit about that specific character.
It appears to be a vertical tab (ASCII 0x0B) "\\v".
It is basically an ASCII control character that is not valid from the XML point
of
view.
I found a lot of articles online about developers that encounter the same kind
of
problems when parsing XML content with such characters in the data.
They often simply skip the character.
The best article I found is
http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invali
d-
character/ in which they make things pretty clear.
For the problem with the connector, the main issue is that the simple presence
of
such a character in a single document of a SharePoint library will prevent the
connector to traverse the entire content of that document library, with no
error
reported in the logs.
As well as you, I also noticed that the SharePoint web service did like the
character either. I found out the source of the problem thanks to errors trying
to
fetch the content of my library through the RSS view of SharePoint.
Hope this help.
Best regards,
Cyril
Original comment by cyrille....@gmail.com
on 21 Nov 2008 at 11:43
I think this is a generic connector manager issue:
http://code.google.com/p/google-enterprise-connector-manager/issues/detail?id=12
8
Original comment by jl1615@gmail.com
on 11 Apr 2009 at 8:21
This is a typical SharePoint web service issue as it fails when the response
contains an invalid XML character in it.
The CM issue mentioned above is very similar to this but do not have any direct
relation with it. The reason being, the connector itself is not able to send
anything to CM. Hence, no point of CM doing any parsing or validation of the
data.
SharePoint conector makes list level calls to get the documents from SharePoint
list/library. Failing of such call leads to no documents being received, no
matter
for whatever reason the call failed. Here, it's because of the invalid XML
character
in one of the document's metadata.
Original comment by th.nitendra
on 13 Apr 2009 at 2:56
Since it is a problem with SharePoint web service, this will not be fixed
Original comment by rakeshs101981@gmail.com
on 29 Oct 2009 at 2:29
The error should be handled more gracefully and reported appropriately in the
logs.
Original comment by darsh...@google.com
on 29 Oct 2009 at 10:51
[deleted comment]
We reproduced this scenario using the list template shared by Cyril.
Connector 2.0.x Behavior:
Web service gets a SAXException due to an invalid XML character in the web
service
response:
org.xml.sax.SAXParseException: Character reference "" is an invalid XML
character
Due to the above exception, all the docs after the problebatic list get skipped.
This also sets the changeToken = null and total docs to be sent from list as 0.
This
further implies that all docs from current list are done and hence the last
feed for
the list itself is sent and change token is ‘null’ (Ideally it will be a
value from
where we can continue is future batch traversals)
Oct 28, 2009 12:47:05 PM [Traverse sharepoint-connector]
com.google.enterprise.connector.sharepoint.spiimpl.SPDocumentList nextDocument
INFO: Sending DocID [ {20001A30-5624-4089-B7C4-FB36D1431020} ], docURL [
http://mycompany.com:80/records/mylist/Forms/AllItems.aspx ] to CM for ADD.
• Since the list is marked as done and changetoken ‘null’, you
will find
that the same docs are re-sent in future batch traversals. For the web service,
changeToken=null means you are starting fresh.
• The same problem should hold true for other folders as well
To be fixed in 2.4 release.
Original comment by rakeshs101981@gmail.com
on 31 Oct 2009 at 4:36
Though, the problem can not be completely solved from connector end, connector
must
recover from such exception and progress with the crawl without any loss of
data.
There can be three approaches for this.
Approach1: Do not update the list's state unless the web service call starts
getting
succeded. This will mean that the crawl will not proceed for the current list
and
change detection will never initiated.
This approach though, not a complete solution for the problem, is simple to
implement
and has been done as a quick workaround for the problem. Refer to
http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
415
Approach2: Skip the current set of document, update the list's state with the
next
expected value of LastDoc which is going to be (LastDoc + batchHint). This way,
we'll
be able to progress with the crawl and change detection. But, the biggest
drawback
with this approach is that a certain set of document will be escaped forever.
These
documents will not be crawled even if, in future, the problem at SharePoint end
is
resolved and web service call starts getting succeeded.
Approach3: Split the current batchhint in sub-hints and make multiple web
service
calls with smaller batchints and locate the problematic document, if any, using
binary search approach. This way we'll skip only the problematic documents
without
any other loss of data. But, this may increase the web service calls heavily in
worst
case scenarios. Also, this approach is suitable only for the SAXParseException
and
should not be followed in other cases.Implementation is a bit complicated.
Original comment by th.nitendra
on 5 Nov 2009 at 8:17
I vote for Approach #3
Original comment by darsh...@google.com
on 6 Nov 2009 at 12:20
If #3 is going to take very long time, we can go for a #2 and log clearly which
part of
the was excluded. If even #2 does not fit in 2.4 timelines, then we need to
have at
least #1 in place.
Original comment by darsh...@google.com
on 16 Nov 2009 at 10:15
Original comment by rakeshs101981@gmail.com
on 9 Dec 2009 at 9:50
Original comment by j.dars...@gmail.com
on 15 Dec 2009 at 11:59
Original comment by j.dars...@gmail.com
on 16 Dec 2009 at 12:00
[deleted comment]
[deleted comment]
A better solution could be to intercept the web service response before it
reaches up to the connector. The interceptor will remove all invalid XML
characters so that connector can do its job smoothly. Message Handlers comes to
the rescue.
Message Handlers are pluggable components to the web service which can
intercept the incoming and outgoing SOAP packets to add various Quality of
Services. Handlers are to web service as filters are to the servlets and
interceptors are to the EJB. The idea is to write a handler that intercepts the
SOAP response to check if it has any invalid XML characters; If found, will
replace all such characters with a fixed replacement value. This approach is
far cleaner and simpler as compared to the earlier approaches that we have
discussed. No mathematics or complicated algorithms are required.
The solution may seem unfeasible at first because even the handlers work with
XML. Fortunately, Axis provides its own handler framework (apart from the one
that JAX-RPC recommends) and these handlers can play their role right upfront
before anything else is done with the incoming/outgoing packets. This also
means that, these handlers do not necessarily work with XML; instead, they can
do with strings as well.
All the deployment related configuration for the handler is done in Axis's
client-config.wsdd file
What to replace and, with what?
-------------------------------
The main question to the handler is that what it should remove from the
response and what to put in place of the filtered characters. Typically, all
invalid XML characters should be replaced. But, doing that would require
finding all those characters which can cause parsing to fail. Instead of doing
that, it would be better if users at the time of deployment decide what should
be replaced. Hence, the client-config.wsdd contains the following information:
1) patterns to be replaced
2) the replacement value. This will be used with all the replacements
With a little more effort, we can make the solution much better. From Issue 50,
we are already aware of certain invalid XML characters that always cause
parsing to fail. If we hard code checks for at least such characters in the
implementation, we can reduce significant deployment overhead. As per the
investigation, the characters that Issue 50 talks about are called Character
entity references. References with following integer values are invalid XML and
causes parsing to fail:
0 to 8
11 to 12
14 to 31
55296 to 57344
65534 and above
The above knowledge can be incorporated into the implementation eliminating the
need to specify patterns for such invalid references in client-config.wsdd.
Some Optimization
-----------------
Message Handler, once configured, intercepts every request and/or response
concerning the web service. It would be nice if the invocation of handler can
be done only if required. This would help in cases where the response does not
contain invalid XML characters. Client can make a normal WS call without
expecting the handler to intercept the call. If the call fails while parsing
the response, caller can make a second attempt for the same WS call, this time
requesting handler to come in action. This can be achieved using a SOAP Headers
called PRECONDITION_HEADER. The handler will be designed to work only if a
PRECONDITION_HEADER is present in the request.
Original comment by th.nitendra
on 7 Sep 2010 at 9:10
Issue is fixed. Revision:
http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
878
to
http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
882
Original comment by th.nitendra
on 6 Oct 2010 at 7:16
Original comment by Shweta.v...@gmail.com
on 6 Oct 2010 at 8:04
Original comment by deshpa...@google.com
on 6 May 2011 at 12:32
Original issue reported on code.google.com by
cyrille....@gmail.com
on 20 Aug 2008 at 7:57Attachments: