Delta creation getting out of hand on large projects

davidediruscio commented 4 years ago

I recently pushed a commit (d3f855e) that details how much time it takes to compute the deltas for VCS / BTS / communication channels. Here's the result for https://github.com/elastic/elasticsearch (one of the project used in Bitergia's use case):

INFO  [ProjectDelta (elasticsearch,20190101)] (14:47:12): Created Delta (vcs:4766ms, communications:1ms, bugs:1796820ms)

This is ~30min for the BTS. The delta creation seems to iterate on a lot of issues:

AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/36265/comments?per_page=100&page=1 )
AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/36263/comments?per_page=100&page=1 )
AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/29963/comments?per_page=100&page=1 )
AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/16654/comments?per_page=100&page=1 )
AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/36251/comments?per_page=100&page=1 )
AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/29970/comments?per_page=100&page=1 )
AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/27312/comments?per_page=100&page=1 )
AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/36258/comments?per_page=100&page=1 )
AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/36256/comments?per_page=100&page=1 )
AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/30957/comments?per_page=100&page=1 )
AbstractInterceptor.intercept( https://api.github.com/repos/elastic/elasticsearch/issues/17997/comments?per_page=100&page=1 )

I understand that the poor performance is probably due to the large amount of data on this project + rate limitation on GitHub's APIs; but would there be any way to improve this?

Also, is there a way to not print these AbstractInterceptor.intercept() debug messages to the console? Thanks!

davidediruscio commented 4 years ago

Hello @tdegueul,

Concerning the time necessary to process GitHub BTS, I guess there are some other reasons as well that might affect the performance:

Since the last update of GitHub RESTMULE Client by York (https://github.com/crossminer/scava/commit/7ad1e4730247a3899f5cb14c33dd3d1dd5a1a340), the number of calls that might be necessary to get the data have increased as there was a pagination issue (see https://github.com/crossminer/scava/issues/314). The problem might come partially from GitHub when large projects are analyzed (and might not return correctly all the pages to analyze?) and the fact that GitHub RESTMULE Client needs to apply some extra pagination strategies to get them all.

The other reason of slowing down the process of BTS might be the fact that GitHub RESTMULE Client is too verbose. I don't know if Java is affected as, for example Perl, where printing too much and too fast on console can slowdown the process completely, while keeping the verbosity as low as possible reduces the processing time. Here is the issue that I have opened regarding GitHub RESTMULE Client verbosity https://github.com/crossminer/scava/issues/310.

It should be interesting if other BTS clients, e.g. GitHub or BitBucket, analysis are affected as well and in the same magnitude. In all the cases, I think it would be productive to talk with @patrickneubauer and @kb634, to try know their point of view regarding this issue.

Similarly, the analysis of large Eclipse Forums, as far as I'm aware, can be slow due to limitations in the API.

davidediruscio commented 4 years ago

Thanks for the insights.

I do not think verbosity is at fault here: it takes a reasonable amount of time to make an API call so the logs are printed rather slowly which shouldn't impact performance at all. I do not think it is necessary to print a log message for every step though: just one message stating that the platform is currently retrieving issues from GitHub should be enough.

We may have a talk with York about that, but I wouldn't be surprised if the relative slowness of the process cannot be avoided on large projects. Then we'll just have to deal with it ;)

davidediruscio commented 4 years ago

I wonder if so many requests are really needed to crawl the issues of such project like https://github.com/docdoku/docdoku-plm/ On the time span of analysis of docdoku, 2018-01-01 to 2019-09-13 we have the following issues that have been "updated":

https://github.com/docdoku/docdoku-plm/issues?utf8=%E2%9C%93&q=is%3Aissue+updated%3A%3E2018-01-01

It is a total 265 issues.

api.github.hits.docdokuplm.txt

It's a total of 23166 requests on the URL pattern above. And you see that some of the issues takes ~250 API requests each.

It sounds a lot of requests to me.

davidediruscio commented 4 years ago

@mhow2 thanks for this info, we will look into it and see if any unnecessary calls are being made.

davidediruscio commented 4 years ago

A quick test using Restmule for that project does 3 calls to get the 271 issues (as expected, 100 per page) and then a further 271 calls to get the comments per issue (as no issue has over 100 comments). The entire process finishes within less than a minute.

The following code is used for this baseline test:

private void search() {
    try {
        IDataSet<Issues> ret = GitHubUtils.getOAuthClient().getReposIssues("docdoku", "docdoku-plm", "all", "all",
                "", "created", "asc", "2018-01-01");
        List<Issues> repoIssues = ret.observe().toList().blockingGet();
        System.out.println(repoIssues.size());
        int comments = 0;
        for (Issues i : repoIssues) {
            // System.out.println(i);
            IDataSet<IssuesComments> ic = GitHubUtils.getOAuthClient().getReposIssuesComments("docdoku",
                    "docdoku-plm", i.getNumber());
            List<IssuesComments> repoComments = ic.observe().toList().blockingGet();
            // System.out.println(repoComments);
            comments = comments + repoComments.size();
        }
        System.out.println(comments);
        System.exit(0);
    } catch (Exception e) {
        e.printStackTrace();
        System.exit(1);
    }
}

@mhow2 @creat89 @tdegueul assuming this test is correct, it would mean that the use of the Restmule code in the Scava platform is at fault, I am happy to have a call with whoever may know more about this, to help with updating it. Cheers.

davidediruscio commented 4 years ago

Yes, I have tested with a small project where you can count manually the number of calls and the Github reader works fine. In the sense that it does the exact number of calls that it is expected to do due to the limitations of GitHub API. Related to this, and valid AFAIK to Github and possibly to Eclipse Forums, the larger the project and the older the delta, the number of calls necessary to find the data will increase, sometimes in a explosive way.

As I discussed with @mhow2, there could be techniques to keep a cache-like solution, were keeping some days in advance could reduce the number of calls for large projects. But also, this could have a footprint in the RAM necessary to store this data, especially if the project is large. How visible would be the footprint, I can't tell, we would need to do some testing.

davidediruscio commented 4 years ago

@creat89 @mhow2 Indeed caching may help if the server can support the extra resources needed. In this regard note that both Restmule and Crossflow support caching, the former for identical HTTP requests and the latter for data processed/output by Tasks. If either of these may end up being useful and you are unsure how to enable them, let us know.

davidediruscio commented 4 years ago

It will take a long time to analyze as well NNTP channels if the channel is old and we are analyzing a recent date (or the gap between the first message and the one of the delta is large). With commit https://github.com/crossminer/scava/commit/e7477e25ce2cddccbbf9305a0817e5d725a2eeac I have improved the reader and now it should only take a large amount to analyze the first delta. The rest, it will store (this time for sure) the last article analyzed and continue from there.

Similarly, analyzing Eclipse Forums will take a while, not only for the limited number of calls, but because for getting data from the API we need to do multiple calls.

eclipse-researchlabs / scava

Delta creation getting out of hand on large projects #44