fossology / fossology

FOSSology is an open source license compliance software system and toolkit. As a toolkit you can run license, copyright and export control scans from the command line. As a system, a database and web ui are provided to give you a compliance workflow. License, copyright and export scanners are tools used in the workflow.
https://fossology.github.io/
GNU General Public License v2.0
794 stars 415 forks source link

Jobs processed slowly on continuously receiving the "Software Heritage X-RateLimit-Limit reached" message #1983

Closed todd101 closed 3 years ago

todd101 commented 3 years ago

Description

Hi,

I am now trying to analyze the OSS compliance of one software package using Fossology built from Github source on Ubuntu 18.04 by the following steps.
  1. Upload the *.tar.gz file by compressing the software package.
  2. Check the status in the "Job" menu.

    Everything went well when uploading a small tarball composed of 8 files. However, the job was blocked on the Softwareheritage step for more than 2 hours if the tarball contains more than 200 files. Please check the picture in the Screenshots area.

    By checking the logs generated under Geeky Scan Details, the job continuously received the "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in xx:xx:xx" messages. The "Average items/sec" column showed that the handler processed 0.045432 items/sec. I had also tried a bigger tarball contained more than 1000 files, and the job failed in the Softwareheritage step after scanning 374 items.

    What's the Software Heritage X-RateLimit-Limit? Why its limit reached? Was this issue caused by CPU's computation power? Or it had something to do with the OS database? How can I "enlarge" the "X-RateLimit-Limit" to speed up the process?

    Thanks.

        Todd

Screenshots

job

Versions

Logs

Job logs

2021-05-07 11:33:58 softwareHeritage [0] :: JOB[169].softwareHeritage[23827.localhost]: "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:58:49" 2021-05-07 11:34:16 softwareHeritage [0] :: JOB[169].softwareHeritage[23827.localhost]: "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:58:31" 2021-05-07 11:34:48 softwareHeritage [0] :: JOB[169].softwareHeritage[23827.localhost]: "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:57:59"

todd101 commented 3 years ago

After running more than 4 hours, the job terminated on failure.

By checking the log generated under Geeky Scan Details, it seemed that the failure was caused by losing connection to the host. Does it mean the scheduler lost connection to the Fossology service? What is the possible reason of this failure?

Thanks.

      Todd

2021-05-07 16:59:15 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:36:42" 2021-05-07 16:59:48 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:36:09" 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "Sorry, something went wrong. check if the host is accessible!" 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "GET /api/1/content/sha256:1098cfeae496c63b5b3c51710b77261c2de27a9abf2ff7e786d0ee0e55a10b06/license HTTP/1.1 " 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "User-Agent: fossology/3.10.0.4-rc2 " 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "Host: archive.softwareheritage.org " 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: " " 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: agent failed with error code 1 2021-05-07 17:00:54 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: agent failed, code: 0

GMishx commented 3 years ago

Hello @todd101 ,

The Software Heritage API performs a rate limiting on the REST API. Please see their website

So the logs INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:58:49 mean that the server has hit the rate limit and will be waiting for next 59 minutes (depends upon X-RateLimit-Reset sent) before sending the next request. Unfortunately their existing endpoints does not allow us to query in batches, so FOSSology has to ping them for every single file.

However, you can ask them to raise the limit for you and set the authentication token from "Admin > Customize > Auth token".

And the other logs you get means that FOSSology is not able to reach archive.softwareheritage.org for the URL in the logs.

Sorry, something went wrong. check if the host is accessible!
GET /api/1/content/sha256:1098cfeae496c63b5b3c51710b77261c2de27a9abf2ff7e786d0ee0e55a10b06/license HTTP/1.1
Obilivion commented 3 years ago

Interesting... Just yesterday I had a call with Roberto from Softwareheritage. He told me, that I should contact you from Fossology to tell you to use the BULK Api, instead of the single file requests. They would like to reduce the traffic by using bulk request.

Maybe you can contact them if it does not work for you?

todd101 commented 3 years ago

Hello GMishx,

Thanks for pointing out this issue was caused by reaching the rate limit of the REST API sent to the server of the Software

Heritage. I think that's the reason why the Software Heritage analysis works for smaller files. I will see if I could get higher rate limit by setting the authentication tokens.

Thanks.

  Todd
todd101 commented 3 years ago

Hi Obilivion,

Thanks for sharing the BULK API information.

As you mentioned, it's reasonable to reduce the traffic by using the bulk API instead of sending the REST API for

each single file.

By checking the release note, the Software Heritage Analysis agent was added in 3.8.0 release candidate 1.

In 3.8.0 release, I saw the note of the Software Heritage Rate limiting item as GMishx shared with me in his comment. Olasd from Software Heritage also mentioned this issue in the post https://github.com/fossology/fossology/issues/1836. It seems that the later release does implement the code in the function processEachPfileForSWH() to add time to reset if X-RateLimit-Limit reached for SWH agent.

Regarding to use the bulk API, I couldn't find any items in UI to configure the SWH agent to use bulk API to send

request to back end side. By checking the source code, I found the function processEachPfileForSWH() is defined in the src/softwareHeritage/agent/softwareHeritageAgent.php. I am wondering whether the SWH agent in the current release only supports the single file request.

Do you know how many files could be queried in one bulk api? If we use the following command to check the

rate limit of the back end side,

curl -i https://archive.softwareheritage.org/api/1/stat/counters/ | grep ^X-RateLimit

the response is

X-RateLimit-Limit: 120 X-RateLimit-Remaining: 119 X-RateLimit-Reset: 1620480485

It seems that number of files in one job should not exceed  (120 x files_per_bulk_api) when using bulk API.

Thanks.

       Todd
zacchiro commented 3 years ago

Heya, Software Heritage co-founder here. I think there are two intertwined matters here. Let me see if I can clarify them and then we can discuss what's the best way forward.

1) rate-limit: yes, there is a rate limit quota. By default it is quite low, as observed. Registered users get 10x the default quota (since last week). And people can get in touch with us to request rate limit lifting for specific users via our contact form. To benefit from rate-limit lifting one needs to query our API via an authentication token. It is my understanding that FOSSology allows to specify such token. Is that correct? If so, nothing needs changing on this front.

2) bulk v. non-bulk API. It seems to me that FOSSology could benefit from two different pieces of information that the Software Heritage archive can provide: (a) whether a file is known to the archive or not, (b) what is it's detected license. (Note that (a) alone is already quite important because if a file that FOSSology users expect to be private is found in Software Heritage, it wasn't that much private after all.) For (a) we have a bulk API endpoint called /known, which allows to query up to 1'000 files (by their SWHID identifiers) consuming a single unit of rate-limit quota. For (b) we currently do not have a bulk API, so they should be queried one by one. But note that currently we only detect licenses using the FOSSology Nomos license scanner, so I'm not sure if there is much value in FOSSology querying that information in the first place. (This can change in the future, when we add other license scanners though.)

Based on this, I think the Software Heritage agent in FOSSology should be changed to benefit only from (a) (known/unknown information) and made to use the bulk API endpoint /known. We can certainly consider adding a bulk API endpoint for (b), if it's needed and useful, but I maintain that right now that won't contribute much to additional license information that FOSSology can already detects itself.

What do you think?

todd101 commented 3 years ago

Hello Zacchiro,

Thanks for providing your comments on this issue.

Since I am a newbie to Fossology and Software Heritage, sorry if I ask any stupid questions.

Instead of digging into the rate-limit or bulk API design philosophy, I would like to know what user can do to leverage the

Software Heritage archive to check for OSS compliance. By using the Fossology toolkit, we can run license, copyright and export control scans on the software. The tool can scan the keywords (Copyright, GPL,......) in the files and generate the analysis report. The Software Heritage archive collect all publicly available software in source code form together with its development history, and share it with everyone who needs it.

How does the Fossology SWH agent work with the back end side to scan for OSS compliance? Does the SWH agent issues

the REST API to pass each file to the back end side to see if this file could be found in the database of the Software Heritage archive? Or the SWH agent sends the file to the back end side to be scanned for special keywords or tags? What's the data delivered in the response of the API? Is it the questioned file OSS or non-OSS?

 You mentioned that two different pieces of information that the Software Heritage archive can provide: (a) whether a file is

known to the archive or not, (b) what is it's detected license. If my understanding is correct, whether a file is known to the archive could be queried using bulk API. The detected licenses could only be queried by single file API. Is it correct?

Could you please let me know what kinds of different licenses are scanned between Fossology tool and Software Heritage

archive? If a file is known to be OSS compliance by the Fossology tool, I am wondering whether the Fossology could add it into the "white list" first. After all of the files have been scanned by the Fossology tool, only those files not in the white list will be sent to Software Heritage for further analysis. This could possibly reduce the traffic sent to the archive.

Thanks.

  Todd
zacchiro commented 3 years ago

Dear @todd101, to be clear, it is called the "Software Heritage agent" within FOSSology, but it has not been developed by us at Software Heritage. Rather, it is my understanding that it has been developed by FOSSology developers in the past in the context of a Google Summer of Code project. I don't know much more than that, so I think FOSSology people will be more qualified to answer your questions about that angle.

I'll be happy to discuss how Software Heritage can help in general with compliance analysis, but this is probably not the right place to do so (we would be hijacking a legitimate bug/issue discussion :-)). Feel free to reach out to me separately about this topic.

todd101 commented 3 years ago

Dear All:

Thanks GMishx, Obilivion, and Zacchiro for your help on this issue.
dineshr93 commented 3 years ago

@todd101 Here is the wiring process of the software heritage agent by the contributor itself https://www.sandipbhuyan.com/updates/4

it seems the sha256 is calculated for every file and only it is sent via rest API to get back the license and origin details.. So I believe not the actual file is sent to SH servers.. Hi @sandipbhuyan please confirm on this .

Thanks..