DataONEorg / dataone-indexer

DataONE Indexer subsystem
Apache License 2.0
0 stars 2 forks source link

Some characters in schema.org documents cause them not be indexed #4

Open taojing2002 opened 3 years ago

taojing2002 commented 3 years ago

When we index the objects from BCODMO, we saw some errors like:

[ERROR] 2021-07-18 19:12:04,664 (HTTPService:writeError:241) URL: http://localhost:8983/solr/search_core/update?commit=true
[ERROR] 2021-07-18 19:12:04,664 (HTTPService:writeError:242) Post: 
[ERROR] 2021-07-18 19:12:04,664 (HTTPService:writeError:245) <?xml version="1.0" encoding="utf-8"?>
<add><doc><field name="id">sha256:26a061c8f8177d417d5ed8b29d8c6cf62f0d9a96bbd08a6afa3e7bc309bc9624</field><field name="seriesId">http://lod.bco-dmo.org/id/dataset/3782</field><field name="fileName">tmpb4ps2wqk</field><field name="mediaType">application/ld+json</field><field name="formatId">science-on-schema.org/Dataset;ld+json</field><field name="formatType">METADATA</field><field name="size">37672</field><field name="checksum">b40b52ac2651f915f0a7b29da8b20bf2</field><field name="submitter">http://orcid.org/0000-0002-6513-4996</field><field name="checksumAlgorithm">MD5</field><field name="rightsHolder">urn:node:BCODMO</field><field name="replicationAllowed">true</field><field name="numberReplicas">3</field><field name="dateUploaded">2019-05-29T20:24:00.000Z</field><field name="dateModified">2021-07-17T20:50:37.000Z</field><field name="datasource">urn:node:BCODMO</field><field name="authoritativeMN">urn:node:BCODMO</field><field name="replicaMN">urn:node:BCODMO</field><field name="replicaMN">urn:node:CN</field><field name="replicationStatus">completed</field><field name="replicationStatus">completed</field><field name="replicaVerifiedDate">2021-07-18T00:17:18.899Z</field><field name="replicaVerifiedDate">2021-07-18T00:17:18.939Z</field><field name="readPermission">public</field><field name="isPublic">true</field><field name="dataUrl">https://cn.dataone.org/cn/v2/resolve/sha256%3A26a061c8f8177d417d5ed8b29d8c6cf62f0d9a96bbd08a6afa3e7bc309bc9624</field><field name="abstract">&lt;p&gt;CTD measurements at water sample depths and Niskin bottle water samples from the Bermuda Atlantic Time-series Study (BATS) and from Station S, located 25 km SE of Bermuda (32°10&#4;N, 64°30&#4;W)&amp;nbsp;Measurements have been collected since 1988 and include nutrients, biogeochemical concentration, bacterial enumeration, and cyanobacteria.&lt;/p&gt;
</field><field name="title">Niskin bottle water samples and CTD measurements at water sample depths collected at Bermuda Atlantic Time-Series sites in the Sargasso Sea ongoing from 1955-01-29 (BATS project)</field><field name="label">Niskin bottle samples</field><field name="awardNumber">http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0752366</field><field name="awardTitle">OCE-0752366</field><field name="author">Rodney Johnson</field><field name="pubDate">2019-05-29T00:00:00.000Z</field><field name="funderIdentifier">https://doi.org/10.13039/100000141</field><field name="funderName">NSF Division of Ocean Sciences</field><field name="origin">Rodney Johnson</field><field name="keywords">oceans</field><field name="southBoundCoord">19.225</field><field name="westBoundCoord">-74.6</field><field name="northBoundCoord">39.455</field><field name="eastBoundCoord">-59.649</field><field name="beginDate">1955-01-29T00:00:00.000Z</field><field name="endDate">2016-12-18T00:00:00.000Z</field><field name="parameter">Sigma-Theta</field><field name="parameter">Nitrite-1</field><field name="parameter">pig7 (19-Hexfu ng/kg)</field><field name="parameter">pig20 (a-Carotene ng/kg)</field><field name="parameter">Nitrite-1</field><field name="parameter">pig5 (19-Butfu ng/kg)</field><field name="parameter">cast number Cast number; 1-80=CTD casts; 81-99=Hydrocasts (i.e. 83 = Data from Hydrocast number 3)</field><field name="parameter">Prochlorococcus</field><field name="parameter">Oxygen-1</field><field name="parameter">Cruise type; 1=BATS core; 2=BATS Bloom a; 3=BATS Bloom b; 5=BATS Validation cruise; 6=Hydrostation</field><field name="parameter">longitude with positive values East</field><field name="parameter">Bacteria enumeration</field><field name="parameter">pig12 (Zea+lut ng/kg)</field><field name="parameter">Salinity-1</field><field name="parameter">Nitrate+Nitrite-1</field><field name="parameter">pig19 (Zeax ng/kg)</field><field name="parameter">Nitrate+Nitrite-1</field><field name="parameter">pig4 (peri ng/kg)</field><field name="parameter">cruise number</field><field name="parameter">Particulate lithogenic silica</field><field name="parameter">date and time represented in ISO 8601 format</field><field name="parameter">Latitude with positive values North</field><field name="parameter">TN NOTE: Prior to BATS 121; DON is reported instead of TON</field><field name="parameter">pig11 (Diat ng/kg)</field><field name="parameter">CTD Salinity</field><field name="parameter">A unique bottle id which identifies cruise; cast; and Nisken number</field><field name="parameter">Nanoeukaryotes</field><field name="parameter">pig18 (Lutein ng/kg)</field><field name="parameter">Alkalinity</field><field name="parameter">pig3 (chl c1+c2 ng/kg)</field><field name="parameter">Particulate biogenic silica</field><field name="parameter">pig16 (Turn Chl a ug/kg)</field><field name="parameter">Pressure</field><field name="parameter">pig21 (b-Carotene ng/kg)</field><field name="parameter">pig17 (Turn Phaeo ug/kg)</field><field name="parameter">Phosphate-1</field><field name="parameter">dissolved inorganic carbon</field><field name="parameter">Synechococcus</field><field name="parameter">pig9 (Diad ng/kg)</field><field name="parameter">Phosphate-1</field><field name="parameter">PON</field><field name="parameter">pig1 (Chl3 c3 ng/kg)</field><field name="parameter">Temperature ITS-90</field><field name="parameter">pig14 (Chl a ng/kg)</field><field name="parameter">POC</field><field name="parameter">pig15 (a+b Carotene ng/kg)</field><field name="parameter">Oxy Anomaly-1</field><field name="parameter">pig8 (Pras ng/kg)</field><field name="parameter">textual description of the cruise type</field><field name="parameter">Total dissolved Phosphorus</field><field name="parameter">Niskin number</field><field name="parameter">Low-level phosphorus</field><field name="parameter">Picoeukaryotes</field><field name="parameter">depth</field><field name="parameter">name of the originators file</field><field name="parameter">pig13 (Chl b ng/kg)</field><field name="parameter">Silicate-1</field><field name="parameter">Decimal Year</field><field name="parameter">pig10 (Allox ng/kg)</field><field name="parameter">Oxygen Fix Temp</field><field name="parameter">Silicate-1</field><field name="parameter">POP</field><field name="parameter">TOC</field><field name="parameter">pig6 (fuco ng/kg)</field><field name="parameter">pig2 (chlidea ng/kg)</field><field name="edition">1</field><field name="serviceEndpoint">https://www.bco-dmo.org/dataset/3782</field></doc></add>
[ERROR] 2021-07-18 19:12:04,665 (HTTPService:writeError:246) 

Response: 

[ERROR] 2021-07-18 19:12:04,665 (HTTPService:writeError:249) <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">500</int><int name="QTime">1</int></lst><lst name="error"><str name="msg">[com.ctc.wstx.exc.WstxLazyException] Illegal character entity: expansion character (code 0x4) not a valid XML character
 at [row,col {unknown-source}]: [2,1686]</str><str name="trace">[com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x4) not a valid XML character
 at [row,col {unknown-source}]: [2,1686]
    at com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
    at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:671)
    at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3505)
    at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:804)
    at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:403)
    at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:249)
    at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
    at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
    at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x4) not a valid XML character
 at [row,col {unknown-source}]: [2,1686]
    at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:451)
    at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2342)
    at com.ctc.wstx.sr.StreamScanner.checkAndExpandChar(StreamScanner.java:2288)
    at com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1147)
    at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4492)
    at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:3964)
    at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3543)
    at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3503)
    ... 32 more
</str><int name="code">500</int></lst>
</response>

[ERROR] 2021-07-18 19:12:04,665 (HTTPService:writeError:238) Unable to write to stream
java.io.IOException: unable to update solr, non 200 response code.
    at org.dataone.cn.indexer.solrhttp.HTTPService.sendUpdate(HTTPService.java:139)
    at org.dataone.cn.indexer.solrhttp.HTTPService.sendUpdate(HTTPService.java:117)
    at org.dataone.cn.indexer.SolrIndexService.sendCommand(SolrIndexService.java:343)
    at org.dataone.cn.indexer.SolrIndexService.insertIntoIndex(SolrIndexService.java:307)
    at org.dataone.cn.index.processor.IndexTaskUpdateProcessor.process(IndexTaskUpdateProcessor.java:50)
    at org.dataone.cn.index.processor.IndexTaskProcessor.processTask(IndexTaskProcessor.java:288)
    at org.dataone.cn.index.processor.IndexTaskProcessor.access$000(IndexTaskProcessor.java:80)
    at org.dataone.cn.index.processor.IndexTaskProcessor$1.run(IndexTaskProcessor.java:265)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

It sounds like the document has some special characters we need to escape.

taojing2002 commented 1 year ago

It seems the document contains some like (32\u00b010\u0004N, 64\u00b030\u0004W) in the description. So the parser can't handle the unicode.

mbjones commented 1 year ago

can you show those characters in context of the schema.org document please?

taojing2002 commented 1 year ago

The original string is:

(32\u00b010\u0004N, 64\u00b030\u0004W)

After expansion (adding context):

(32°10\u0004N, 64°30\u0004W)

In the solr doc before sending to the solr serever:

(32°10&#4;N, 64°30&#4;W)
taojing2002 commented 1 year ago

In another description, it has the value:

32\u00b0 10'N, 64\u00b0 30'W

after expansion and compaction:

32° 10'N, 64° 30'W

The solr doc is:

32° 10&apos;N, 64° 30&apos;W

It works well.

taojing2002 commented 1 year ago

It seems the author uses \u0004, which is (EOT) to replace the apostrophe, which is \u0027. After I replace \u0004 by \u0027. Everything works. But I am not sure why solr can't handle EOT (&#4;).

taojing2002 commented 1 year ago

We need to escape the special character in dataone-indexer