codelibs / gitbucket-fess-plugin

GitBucket plugin for Fess
21 stars 11 forks source link

[question] what kind of http request method using with file crawling? #20

Open sho-suzuki opened 6 years ago

sho-suzuki commented 6 years ago

plugin version

1.3.1

gitbucket version

4.20

what is matter

under the proxy environment . I can't get content from files but can get issue, wikis. fess-crawler.log is as follows,

# file crawling log
2018-02-13 18:12:32,511 [5DFNjmEBO7Desvq7XhyO-1] INFO  Get a content from http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge
[2018-02-13 18:12:35,028 [5DFNjmEBO7Desvq7XhyO-1] WARN  Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e
org.codelibs.fess.crawler.exception.CrawlingAccessException: Failed to parse http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true
        at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:184) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeFileContent(GitBucketDataStoreImpl.java:291) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.lambda$storeData$4713(GitBucketDataStoreImpl.java:134) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:441) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeData(GitBucketDataStoreImpl.java:124) [classes/:?]
        at org.codelibs.fess.ds.impl.AbstractDataStoreImpl.store(AbstractDataStoreImpl.java:106) [classes/:?]
        at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.process(DataIndexHelper.java:236) [classes/:?]
        at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.run(DataIndexHelper.java:222) [classes/:?]
Caused by: org.codelibs.fess.crawler.exception.MultipleCrawlingAccessException: 
Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true;
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true
        at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:95) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:148) ~[classes/:?]
        ... 9 more

# issue crawl log
2018-02-13 18:43:02,794 [5DFNjmEBO7Desvq7XhyO-1] INFO  Get a content from http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/issues/17

On Linux, both requests seem to return the same result.

# file request
curl http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/README.md
{"message":"Requires authentication"}
# issue request
curl http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/issues/21
{"message":"Requires authentication"}

I think that it may be a problem in setting proxy. (Proxy discards file request) I would like to know about the http request of the file crawl API.

thanks.

keiichiw commented 6 years ago

I'm not sure that your problem is caused by the proxy but could you try the following command?

$ curl -H "Authorization: token <token>" "http://localhost:8080/gitbucket/api/v3/repos/<user name>/<repository name>/contents/<file name>?ref=<commit hash>&large_file=true"

The value <token> is the one generated by GitBucket here.

The value <commit hash> is b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e in your case. It can be obtained by:

$ curl -H "Authorization: token <token>" "http://localhost:8080/gitbucket/api/v3/repos/<user name>/<repository name>/git/refs/heads/master

If you want to learn how Fess gets files more, see GitBucketDataStoreImpl.java.

sho-suzuki commented 6 years ago

thanks @kw-udon. I got a response when I submitted a command you pointed out.

# curl -H "Authorization: token 284530a64e55176f9ed9*********" "http://gitbucket:8080/gitbucket/api/v3/repos/root/name/contents/hoge?ref=efcd9adbec49f73f762b7b2127153593024e4bea&large_file=true"

{"type":"file","name":"hoge","path":"hoge","sha":"efcd9adbec49f73f762b7b2127153593024e4bea","content":"IyBBcHAgYXJ0aWZhY3RzCi9fYnVpbGQKLLmV4cw==","encoding":"base64","download_url":"http://gitbucket:8080/gitbucket/api/v3/repos/root/name/raw/efcd9adbec49f73f762b7b2127153593024e4bea/hoge"}

so proxy didn't discard request and refused.

keiichiw commented 6 years ago

MultipleCrawlingAccessException is occured in your log file, but I don't know what can raise this exception. Do you have any idea @marevol?

marevol commented 6 years ago

Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):

The cause is above. It's a network problem. I think that the problem is a proxy setting or the like.

sho-suzuki commented 6 years ago

@marevol @kw-udon There is only one crawler that crawls gitbucket. How do I get detailed logs to execute curl request when crawling starts?

marevol commented 6 years ago

https://github.com/codelibs/fess/issues/1073#issuecomment-304397187

sho-suzuki commented 6 years ago

@marevol thanks! I set the crawl log level info to debug, fess-crawler.log is as follows.

2018-02-15 14:15:37,744 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Accessing http://gitbucket:8080/gitbucket/api/v3/repos/user/repo/contents/hoge?ref=37cce0819cdf0a357e0b5e9bc373030dbfa84cd6&large_file=true
2018-02-15 14:15:37,745 [5DFNjmEBO7Desvq7XhyO-1] DEBUG CookieSpec selected: default
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection request: [route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 0 of 20; total allocated: 0 of 200]
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection leased: [id: 1][route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 1 of 20; total allocated: 1 of 200]
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Opening connection {}->http://gitbucket:8080
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connecting to gitbucket/IP:8080
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG http-outgoing-1: Shutdown connection
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection discarded
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection released: [id: 1][route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 0 of 20; total allocated: 0 of 200]
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Cancelling request execution
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/user/repo/contents/hoge?ref=37cce0819cdf0a357e0b5e9bc373030dbfa84cd6&large_file=true
org.codelibs.fess.crawler.exception.CrawlingAccessException: Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): http://gitbucket:8080/gitbucket/api/v3/repos/user/repo/contents/hoge?ref=37cce0819cdf0a357e0b5e9bc373030dbfa84cd6&large_file=true
        at org.codelibs.fess.crawler.client.http.HcHttpClient.processHttpMethod(HcHttpClient.java:820) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.crawler.client.http.HcHttpClient.doHttpMethod(HcHttpClient.java:623) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:582) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:142) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:67) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:148) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeFileContent(GitBucketDataStoreImpl.java:291) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.lambda$storeData$4713(GitBucketDataStoreImpl.java:134) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:441) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeData(GitBucketDataStoreImpl.java:124) [classes/:?]
        at org.codelibs.fess.ds.impl.AbstractDataStoreImpl.store(AbstractDataStoreImpl.java:106) [classes/:?]
        at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.process(DataIndexHelper.java:236) [classes/:?]
        at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.run(DataIndexHelper.java:222) [classes/:?]
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.4.jar:4.5.4]
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG http-outgoing-1: Shutdown connection
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection discarded
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection released: [id: 1][route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 0 of 2
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_161]
        at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_161]
        at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.4.jar:4.5.4]
        at org.codelibs.fess.crawler.client.http.HcHttpClient.executeHttpClient(HcHttpClient.java:852) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.crawler.client.http.HcHttpClient.processHttpMethod(HcHttpClient.java:660) ~[fess-crawler-2.0.1.jar:?]
        ... 13 more
...
2018-02-15 14:15:42,103 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2018-02-15 14:15:42,105 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS

From this log connection appears to be disconnected by connection timeout or connection refused. and I also changed gitbucket logback-setting.xml like this, but no application log found.

<configuration debug="true" scan="true" scanPeriod="60 seconds"> 
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <!-- encoders are  by default assigned the type
         ch.qos.logback.classic.encoder.PatternLayoutEncoder -->

        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>INFO</level>
        </filter>
        <encoder>
            <pattern> %date %-4relative [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>

    <appender name="ROLLING" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <!-- encoders are  by default assigned the type
         ch.qos.logback.classic.encoder.PatternLayoutEncoder -->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <!-- rollover daily and compress-->
            <fileNamePattern>/gitbucket/log/gitbucket-%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <!-- compressed logs are remains 30 days and then deleted -->
            <maxHistory>30</maxHistory>
            <timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>25MB</maxFileSize>
            </timeBasedFileNamingAndTriggeringPolicy>
        </rollingPolicy>

        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>INFO</level>
        </filter>
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} %-4relative [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>

    <root level="DEBUG">
        <appender-ref ref="STDOUT"/>
        <appender-ref ref="ROLLING"/>
    </root>
</configuration>

any ideas?

marevol commented 6 years ago

Did you configure proxy settings? See https://github.com/codelibs/fess/issues/1066

sho-suzuki commented 6 years ago

@marevol yes. I configured proxy setting in fess_config.properties

http.proxy.host=proxy_IP
http.proxy.port=proxy_port
http.proxy.username=
http.proxy.password=