codelibs / fess-ds-atlassian

DataStore Crawler for JIRA/Confluence
Apache License 2.0
7 stars 5 forks source link

Crawler times out and stops #11

Closed siwyd closed 5 years ago

siwyd commented 5 years ago

I've set up Fess to crawl a Confluence site. It seems to be working, however after a few pages the crawling seems to halt because of a timeout. What steps can I take to keep the crawler going? I'm also unsure on how to make the crawler crawl at a slower rate. The crawl rate can be slow, that doesn't matter much to us.

fess_1     | 2019-06-18 10:56:44,472 [20190618105018-1] INFO  Sent 5 docs (Doc:{send 75ms}, Mem:{used 156MB, heap 512MB, max 512MB})
fess_1     | 2019-06-18 10:56:45,879 [20190618105018-1] INFO  Sent 5 docs (Doc:{send 62ms}, Mem:{used 174MB, heap 512MB, max 512MB})
fess_1     | 2019-06-18 10:56:47,167 [20190618105018-1] INFO  Sent 5 docs (Doc:{send 49ms}, Mem:{used 170MB, heap 512MB, max 512MB})
fess_1     | 2019-06-18 10:56:48,700 [20190618105018-1] INFO  Sent 5 docs (Doc:{send 77ms}, Mem:{used 166MB, heap 512MB, max 512MB})
fess_1     | 2019-06-18 10:56:50,131 [20190618105018-1] INFO  Sent 5 docs (Doc:{send 96ms}, Mem:{used 168MB, heap 512MB, max 512MB})
fess_1     | 2019-06-18 10:57:10,153 [20190618105018-1] ERROR Failed to process a data crawling: Confluence
fess_1     | org.codelibs.fess.ds.atlassian.AtlassianDataStoreException: Failed to request: http://myconfluence/confluence/rest/api/latest/content?expand=space,version,body.view&start=450&limit=25
fess_1     |    at org.codelibs.fess.ds.atlassian.api.confluence.content.GetContentsRequest.execute(GetContentsRequest.java:66) ~[fess-ds-atlassian-13.1.0.jar:?]
fess_1     |    at org.codelibs.fess.ds.atlassian.ConfluenceDataStore.storeData(ConfluenceDataStore.java:120) ~[fess-ds-atlassian-13.1.0.jar:?]
fess_1     |    at org.codelibs.fess.ds.AbstractDataStore.store(AbstractDataStore.java:110) ~[classes/:?]
fess_1     |    at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.process(DataIndexHelper.java:227) [classes/:?]
fess_1     |    at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.run(DataIndexHelper.java:213) [classes/:?]
fess_1     | Caused by: java.net.SocketTimeoutException: Read timed out
fess_1     |    at java.net.SocketInputStream.socketRead0(Native Method) ~[?:?]
fess_1     |    at java.net.SocketInputStream.socketRead(SocketInputStream.java:115) ~[?:?]
fess_1     |    at java.net.SocketInputStream.read(SocketInputStream.java:168) ~[?:?]
fess_1     |    at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
fess_1     |    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:448) ~[?:?]
fess_1     |    at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68) ~[?:?]
fess_1     |    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1104) ~[?:?]
fess_1     |    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:823) ~[?:?]
fess_1     |    at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
fess_1     |    at java.io.BufferedInputStream.read1(BufferedInputStream.java:292) ~[?:?]
fess_1     |    at java.io.BufferedInputStream.read(BufferedInputStream.java:351) ~[?:?]
fess_1     |    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:746) ~[?:?]
fess_1     |    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:689) ~[?:?]
fess_1     |    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1604) ~[?:?]
fess_1     |    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1509) ~[?:?]
fess_1     |    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527) ~[?:?]
fess_1     |    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:329) ~[?:?]
fess_1     |    at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37) ~[google-http-client-1.25.0.jar:1.25.0]
fess_1     |    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:105) ~[google-http-client-1.25.0.jar:1.25.0]
fess_1     |    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981) ~[google-http-client-1.25.0.jar:1.25.0]
fess_1     |    at org.codelibs.fess.ds.atlassian.api.confluence.content.GetContentsRequest.execute(GetContentsRequest.java:51) ~[fess-ds-atlassian-13.1.0.jar:?]
fess_1     |    ... 4 more
fess_1     | 2019-06-18 10:57:10,304 [20190618105018-1] INFO  Deleted 0 old docs.
fess_1     | 2019-06-18 10:57:10,308 [DataStoreCrawler] INFO  [EXEC TIME] crawling time: 402419ms
fess_1     | 2019-06-18 10:57:10,308 [main] INFO  Finished Crawler
fess_1     | 2019-06-18 10:57:10,403 [main] INFO  [CRAWL INFO] DataCrawlExecTime=402419,DataCrawlEndTime=2019-06-18T10:57:10.308+0000,CrawlerEndTime=2019-06-18T10:57:10.309+0000,DataIndexExecTime=15117,CrawlerStatus=true,CrawlerStartTime=2019-06-18T10:50:27.776+0000,WebFsCrawlEndTime=2019-06-18T10:50:27.907+0000,DataIndexSize=450,CrawlerExecTime=402533,DataCrawlStartTime=2019-06-18T10:50:27.839+0000,WebFsCrawlStartTime=2019-06-18T10:50:27.837+0000
fess_1     | 2019-06-18 10:57:10,410 [main] INFO  Destroyed LaContainer.
siwyd commented 5 years ago

Some more info: I think the request to the Confluence API in question simply takes a long time (but it does complete when I run it manually). I think I need a larger timeout. How could I got about increasing the timeout?