clearlydefined / website

Website for clearlydefined.io
https://clearlydefined.io
MIT License
26 stars 29 forks source link

No files showing on components details page #964

Open capfei opened 2 years ago

capfei commented 2 years ago

I'm seeing this frequently where a component is harvested but no files are shown.

Some recent examples:

image
qtomlinson commented 2 years ago

The raw data for https://clearlydefined.io/definitions/pypi/pypi/-/dnspython/1.10.0 shows files as [].

  "clearlydefined": {
    "1.3.1": {
      "_metadata": {
     },
      "summaryInfo": {
     },
      "files": [],
      "registryData": {
     ...

The issue for pypi/pypi/-/dnspython/1.10.0 is a bug in crawler: pypiFetch failed to find tar.gz file and interrupted file downloading without reporting an error.

qtomlinson commented 2 years ago

For github.com/linux-audit/audit-userspace/5fae55c1ad15b3cefe6890eba7311af163e9133c, and git/github/golang/crypto/c084706c2272f3d44b722e988e70d4a58e60e7f4, the reason for "no files" is that only "licensee" tool was run. In the definition page, "Tools" section shows only licensee and curation. In the raw data section, only licensee portion of the json is available. For the files to be listed properly, "clearlydefined" tool needs to be run and its corresponding json result should be available.

image

In my local environment, files are available in both cases after "source" typed harvests (clearlydefined + licensee + scancode) are completed. Harvest was also initiated on dev server and result available at: https://dev.clearlydefined.io/definitions/git/github/linux-audit/audit-userspace/5fae55c1ad15b3cefe6890eba7311af163e9133c/5fae55c1ad15b3cefe6890eba7311af163e9133c. Files are available and displayed upon completion of the harvest.

These two look like cases of incomplete harvest.

qtomlinson commented 1 year ago

definitions/pypi/pypi/-/dnspython/1.10.0: there is no download url in pypi registry for dnspython 1.10.0, so download failed. See commit message in https://github.com/clearlydefined/crawler/pull/470 For the remaining partial harvest cases, need to trigger re-harvest to resolve: -https://clearlydefined.io/definitions/git/github/golang/crypto/c084706c2272f3d44b722e988e70d4a58e60e7f4: files now available. -retriggered harvest for git/github/linux-audit/audit-userspace/5fae55c1ad15b3cefe6890eba7311af163e9133c/5fae55c1ad15b3cefe6890eba7311af163e9133c

bduranc commented 1 year ago

@qtomlinson Thanks for looking into this.

https://clearlydefined.io/definitions/git/github/golang/crypto/c084706c2272f3d44b722e988e70d4a58e60e7f4

and

https://clearlydefined.io/definitions/git/github/linux-audit/audit-userspace/5fae55c1ad15b3cefe6890eba7311af163e9133c/5fae55c1ad15b3cefe6890eba7311af163e9133c

both look to have successfully harvested.

There's still with the below. I can confirm there is no download package in PyPi for this component.

https://clearlydefined.io/definitions/pypi/pypi/-/dnspython/1.10.0

Question: Is CD supposed to be showing "harvested" if the system can't find the package like in this example?

qtomlinson commented 1 year ago

@bduranc Those harvest requests will be marked missing in the crawler (See commit message at https://github.com/clearlydefined/crawler/pull/470) and will not be marked as successful in the future.

bduranc commented 1 year ago

Thanks @qtomlinson . This is a fairly important issue since it involved scans that were "harvested" but had no files to scan (or just a LICENSE file in a few other examples I had observed previously but re-harvested). But it sounds like there is a solution in place to address at least the cases like dnspython where package download/source cannot be not found.

For the other two, where package/source is indeed available, is the best solution just to reharvest them when encountered or is there something else we can do?

qtomlinson commented 1 year ago

@bduranc If clearlydefined tool has not been completed, then re-harvesting the package is the best solution.

bduranc commented 1 year ago

@bduranc If clearlydefined tool has not been completed, then re-harvesting the package is the best solution.

@qtomlinson should I go ahead and create a separate issue for this then?

qtomlinson commented 1 year ago

@bduranc Typically, clearlydefined, reuse, licensee and scancode tools are dispatched for source components. It is possible that all four tools were dispatched, but only one tool was processed and the other three runs were somehow not successful. Retriggering harvests can verify whether a potential issue exists. The re-harvested data are now available and seem ok.

Alternatively, a user has the option to run harvest with a specific tool (e.g. licensee or scancode) via REST api. In that case, only the result for the user specified tool is available (as expected). The two components listed here might have been cases of harvesting with a specific tool (licensee). To get the complete definition, retriggering harvest with all tools is the solution for that scenario.