Open capfei opened 2 years ago
The raw data for https://clearlydefined.io/definitions/pypi/pypi/-/dnspython/1.10.0 shows files as [].
"clearlydefined": {
"1.3.1": {
"_metadata": {
},
"summaryInfo": {
},
"files": [],
"registryData": {
...
The issue for pypi/pypi/-/dnspython/1.10.0 is a bug in crawler: pypiFetch failed to find tar.gz file and interrupted file downloading without reporting an error.
For github.com/linux-audit/audit-userspace/5fae55c1ad15b3cefe6890eba7311af163e9133c, and git/github/golang/crypto/c084706c2272f3d44b722e988e70d4a58e60e7f4, the reason for "no files" is that only "licensee" tool was run. In the definition page, "Tools" section shows only licensee and curation. In the raw data section, only licensee portion of the json is available. For the files to be listed properly, "clearlydefined" tool needs to be run and its corresponding json result should be available.
In my local environment, files are available in both cases after "source" typed harvests (clearlydefined + licensee + scancode) are completed. Harvest was also initiated on dev server and result available at: https://dev.clearlydefined.io/definitions/git/github/linux-audit/audit-userspace/5fae55c1ad15b3cefe6890eba7311af163e9133c/5fae55c1ad15b3cefe6890eba7311af163e9133c. Files are available and displayed upon completion of the harvest.
These two look like cases of incomplete harvest.
definitions/pypi/pypi/-/dnspython/1.10.0: there is no download url in pypi registry for dnspython 1.10.0, so download failed. See commit message in https://github.com/clearlydefined/crawler/pull/470 For the remaining partial harvest cases, need to trigger re-harvest to resolve: -https://clearlydefined.io/definitions/git/github/golang/crypto/c084706c2272f3d44b722e988e70d4a58e60e7f4: files now available. -retriggered harvest for git/github/linux-audit/audit-userspace/5fae55c1ad15b3cefe6890eba7311af163e9133c/5fae55c1ad15b3cefe6890eba7311af163e9133c
@qtomlinson Thanks for looking into this.
and
both look to have successfully harvested.
There's still with the below. I can confirm there is no download package in PyPi for this component.
https://clearlydefined.io/definitions/pypi/pypi/-/dnspython/1.10.0
Question: Is CD supposed to be showing "harvested" if the system can't find the package like in this example?
@bduranc Those harvest requests will be marked missing in the crawler (See commit message at https://github.com/clearlydefined/crawler/pull/470) and will not be marked as successful in the future.
Thanks @qtomlinson . This is a fairly important issue since it involved scans that were "harvested" but had no files to scan (or just a LICENSE file in a few other examples I had observed previously but re-harvested). But it sounds like there is a solution in place to address at least the cases like dnspython where package download/source cannot be not found.
For the other two, where package/source is indeed available, is the best solution just to reharvest them when encountered or is there something else we can do?
@bduranc If clearlydefined
tool has not been completed, then re-harvesting the package is the best solution.
@bduranc If
clearlydefined
tool has not been completed, then re-harvesting the package is the best solution.
@qtomlinson should I go ahead and create a separate issue for this then?
@bduranc Typically, clearlydefined, reuse, licensee and scancode tools are dispatched for source components. It is possible that all four tools were dispatched, but only one tool was processed and the other three runs were somehow not successful. Retriggering harvests can verify whether a potential issue exists. The re-harvested data are now available and seem ok.
Alternatively, a user has the option to run harvest with a specific tool (e.g. licensee or scancode) via REST api. In that case, only the result for the user specified tool is available (as expected). The two components listed here might have been cases of harvesting with a specific tool (licensee). To get the complete definition, retriggering harvest with all tools is the solution for that scenario.
I'm seeing this frequently where a component is harvested but no files are shown.
Some recent examples: