Closed todd101 closed 3 years ago
After running more than 4 hours, the job terminated on failure.
By checking the log generated under Geeky Scan Details, it seemed that the failure was caused by losing connection to the host. Does it mean the scheduler lost connection to the Fossology service? What is the possible reason of this failure?
Thanks.
Todd
2021-05-07 16:59:15 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:36:42" 2021-05-07 16:59:48 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:36:09" 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "Sorry, something went wrong. check if the host is accessible!" 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "GET /api/1/content/sha256:1098cfeae496c63b5b3c51710b77261c2de27a9abf2ff7e786d0ee0e55a10b06/license HTTP/1.1 " 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "User-Agent: fossology/3.10.0.4-rc2 " 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: "Host: archive.softwareheritage.org " 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: " " 2021-05-07 17:00:53 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: agent failed with error code 1 2021-05-07 17:00:54 softwareHeritage [0] :: JOB[187].softwareHeritage[4842.localhost]: agent failed, code: 0
Hello @todd101 ,
The Software Heritage API performs a rate limiting on the REST API. Please see their website
So the logs INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:58:49
mean that the server has hit the rate limit and will be waiting for next 59 minutes (depends upon X-RateLimit-Reset
sent) before sending the next request.
Unfortunately their existing endpoints does not allow us to query in batches, so FOSSology has to ping them for every single file.
However, you can ask them to raise the limit for you and set the authentication token from "Admin > Customize > Auth token".
And the other logs you get means that FOSSology is not able to reach archive.softwareheritage.org for the URL in the logs.
Sorry, something went wrong. check if the host is accessible!
GET /api/1/content/sha256:1098cfeae496c63b5b3c51710b77261c2de27a9abf2ff7e786d0ee0e55a10b06/license HTTP/1.1
Interesting... Just yesterday I had a call with Roberto from Softwareheritage. He told me, that I should contact you from Fossology to tell you to use the BULK Api, instead of the single file requests. They would like to reduce the traffic by using bulk request.
Maybe you can contact them if it does not work for you?
Hello GMishx,
Thanks for pointing out this issue was caused by reaching the rate limit of the REST API sent to the server of the Software
Heritage. I think that's the reason why the Software Heritage analysis works for smaller files. I will see if I could get higher rate limit by setting the authentication tokens.
Thanks.
Todd
Hi Obilivion,
Thanks for sharing the BULK API information.
As you mentioned, it's reasonable to reduce the traffic by using the bulk API instead of sending the REST API for
each single file.
By checking the release note, the Software Heritage Analysis agent was added in 3.8.0 release candidate 1.
In 3.8.0 release, I saw the note of the Software Heritage Rate limiting item as GMishx shared with me in his comment. Olasd from Software Heritage also mentioned this issue in the post https://github.com/fossology/fossology/issues/1836. It seems that the later release does implement the code in the function processEachPfileForSWH() to add time to reset if X-RateLimit-Limit reached for SWH agent.
Regarding to use the bulk API, I couldn't find any items in UI to configure the SWH agent to use bulk API to send
request to back end side. By checking the source code, I found the function processEachPfileForSWH() is defined in the src/softwareHeritage/agent/softwareHeritageAgent.php. I am wondering whether the SWH agent in the current release only supports the single file request.
Do you know how many files could be queried in one bulk api? If we use the following command to check the
rate limit of the back end side,
curl -i https://archive.softwareheritage.org/api/1/stat/counters/ | grep ^X-RateLimit
the response is
X-RateLimit-Limit: 120 X-RateLimit-Remaining: 119 X-RateLimit-Reset: 1620480485
It seems that number of files in one job should not exceed (120 x files_per_bulk_api) when using bulk API.
Thanks.
Todd
Heya, Software Heritage co-founder here. I think there are two intertwined matters here. Let me see if I can clarify them and then we can discuss what's the best way forward.
1) rate-limit: yes, there is a rate limit quota. By default it is quite low, as observed. Registered users get 10x the default quota (since last week). And people can get in touch with us to request rate limit lifting for specific users via our contact form. To benefit from rate-limit lifting one needs to query our API via an authentication token. It is my understanding that FOSSology allows to specify such token. Is that correct? If so, nothing needs changing on this front.
2) bulk v. non-bulk API. It seems to me that FOSSology could benefit from two different pieces of information that the Software Heritage archive can provide: (a) whether a file is known to the archive or not, (b) what is it's detected license. (Note that (a) alone is already quite important because if a file that FOSSology users expect to be private is found in Software Heritage, it wasn't that much private after all.) For (a) we have a bulk API endpoint called /known, which allows to query up to 1'000 files (by their SWHID identifiers) consuming a single unit of rate-limit quota. For (b) we currently do not have a bulk API, so they should be queried one by one. But note that currently we only detect licenses using the FOSSology Nomos license scanner, so I'm not sure if there is much value in FOSSology querying that information in the first place. (This can change in the future, when we add other license scanners though.)
Based on this, I think the Software Heritage agent in FOSSology should be changed to benefit only from (a) (known/unknown information) and made to use the bulk API endpoint /known. We can certainly consider adding a bulk API endpoint for (b), if it's needed and useful, but I maintain that right now that won't contribute much to additional license information that FOSSology can already detects itself.
What do you think?
Hello Zacchiro,
Thanks for providing your comments on this issue.
Since I am a newbie to Fossology and Software Heritage, sorry if I ask any stupid questions.
Instead of digging into the rate-limit or bulk API design philosophy, I would like to know what user can do to leverage the
Software Heritage archive to check for OSS compliance. By using the Fossology toolkit, we can run license, copyright and export control scans on the software. The tool can scan the keywords (Copyright, GPL,......) in the files and generate the analysis report. The Software Heritage archive collect all publicly available software in source code form together with its development history, and share it with everyone who needs it.
How does the Fossology SWH agent work with the back end side to scan for OSS compliance? Does the SWH agent issues
the REST API to pass each file to the back end side to see if this file could be found in the database of the Software Heritage archive? Or the SWH agent sends the file to the back end side to be scanned for special keywords or tags? What's the data delivered in the response of the API? Is it the questioned file OSS or non-OSS?
You mentioned that two different pieces of information that the Software Heritage archive can provide: (a) whether a file is
known to the archive or not, (b) what is it's detected license. If my understanding is correct, whether a file is known to the archive could be queried using bulk API. The detected licenses could only be queried by single file API. Is it correct?
Could you please let me know what kinds of different licenses are scanned between Fossology tool and Software Heritage
archive? If a file is known to be OSS compliance by the Fossology tool, I am wondering whether the Fossology could add it into the "white list" first. After all of the files have been scanned by the Fossology tool, only those files not in the white list will be sent to Software Heritage for further analysis. This could possibly reduce the traffic sent to the archive.
Thanks.
Todd
Dear @todd101, to be clear, it is called the "Software Heritage agent" within FOSSology, but it has not been developed by us at Software Heritage. Rather, it is my understanding that it has been developed by FOSSology developers in the past in the context of a Google Summer of Code project. I don't know much more than that, so I think FOSSology people will be more qualified to answer your questions about that angle.
I'll be happy to discuss how Software Heritage can help in general with compliance analysis, but this is probably not the right place to do so (we would be hijacking a legitimate bug/issue discussion :-)). Feel free to reach out to me separately about this topic.
Dear All:
Thanks GMishx, Obilivion, and Zacchiro for your help on this issue.
@todd101 Here is the wiring process of the software heritage agent by the contributor itself https://www.sandipbhuyan.com/updates/4
it seems the sha256 is calculated for every file and only it is sent via rest API to get back the license and origin details.. So I believe not the actual file is sent to SH servers.. Hi @sandipbhuyan please confirm on this .
Thanks..
Description
Hi,
Check the status in the "Job" menu.
Everything went well when uploading a small tarball composed of 8 files. However, the job was blocked on the Softwareheritage step for more than 2 hours if the tarball contains more than 200 files. Please check the picture in the Screenshots area.
By checking the logs generated under Geeky Scan Details, the job continuously received the "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in xx:xx:xx" messages. The "Average items/sec" column showed that the handler processed 0.045432 items/sec. I had also tried a bigger tarball contained more than 1000 files, and the job failed in the Softwareheritage step after scanning 374 items.
What's the Software Heritage X-RateLimit-Limit? Why its limit reached? Was this issue caused by CPU's computation power? Or it had something to do with the OS database? How can I "enlarge" the "X-RateLimit-Limit" to speed up the process?
Thanks.
Screenshots
Versions
Logs
Job logs
2021-05-07 11:33:58 softwareHeritage [0] :: JOB[169].softwareHeritage[23827.localhost]: "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:58:49" 2021-05-07 11:34:16 softwareHeritage [0] :: JOB[169].softwareHeritage[23827.localhost]: "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:58:31" 2021-05-07 11:34:48 softwareHeritage [0] :: JOB[169].softwareHeritage[23827.localhost]: "INFO :Software Heritage X-RateLimit-Limit reached. Next slot unlocks in 00:57:59"