clearlydefined / crawler

A service that crawls projects and packages for information relevant to ClearlyDefined
MIT License
43 stars 30 forks source link

Harvesting not working #561

Open glogowski-wojciech-MSFT opened 2 months ago

glogowski-wojciech-MSFT commented 2 months ago

I submitted harvesting requests for the following Python packages using the clearlydefined website, first on 2024-02-28 in a single query, and then on 2024-02-29, each one in a separate query, in a separate browser tab. As of today, 2024-03-04, none of these packages were harvested:

package name 2024-02-28 query 2024-02-29 query
pypi/pypi/-/nvidia-cublas-cu12/12.1.3.1 1 1
pypi/pypi/-/nvidia-cuda-cupti-cu12/12.1.105 1 1
pypi/pypi/-/nvidia-cuda-nvrtc-cu12/12.1.105 1
pypi/pypi/-/nvidia-cuda-runtime-cu12/12.1.105 1 1
pypi/pypi/-/nvidia-cudnn-cu12/8.9.2.26 1 1
pypi/pypi/-/nvidia-cufft-cu12/11.0.2.54 1 1
pypi/pypi/-/nvidia-curand-cu12/10.3.2.106 1
pypi/pypi/-/nvidia-cusolver-cu12/11.4.5.107 1 1
pypi/pypi/-/nvidia-cusparse-cu12/12.1.0.106 1 1
pypi/pypi/-/nvidia-nccl-cu12/2.19.3 1 1
pypi/pypi/-/nvidia-nvjitlink-cu12/12.3.101 1 1
pypi/pypi/-/nvidia-nvtx-cu12/12.1.105 1 1
pypi/pypi/-/onnxruntime/1.17.1 1
pypi/pypi/-/tensorboard-data-server/0.7.2 1
pypi/pypi/-/tensorboard/2.16.2 1
pypi/pypi/-/thop/0.1.1.post2207130030 1 1
pypi/pypi/-/torch/2.2.1 1 1
pypi/pypi/-/torchvision/0.17.1 1 1
pypi/pypi/-/triton/2.2.0 1

The harvesting either does not reliably work or takes a very long time (5 days and counting). Either way I believe this requires a fix or at least extra documentation. I will also appreciate help with harvesting these specific packages.

glogowski-wojciech-MSFT commented 2 months ago

As of today, 2024-03-11, none of these packages was harvested. I have requested the harvesting again programmatically on 2024-03-06 and received 201 HTTP responses. Given that it is 12 days since the original harvesting requests, I am changing the issue title from "Harvesting not working or taking very long" to "Harvesting not working".

qtomlinson commented 1 month ago

@glogowski-wojciech-MSFT Thanks for reporting the issue! In ClearlyDefined, we typically download source distributions (*.tar.gz or *.zip) for Python packages. However, upon checking the first three packages on PyPI, it was found that they do not have source distributions available. You can find the package information here: https://pypi.org/project/nvidia-cublas-cu12/12.1.3.1/#files https://pypi.org/project/nvidia-cuda-cupti-cu12/12.1.105/#files https://pypi.org/project/nvidia-cuda-runtime-cu12/12.1.105/#files

The absence of source distributions may be the reason why the harvesting process failed for the listed packages.

qtomlinson commented 2 weeks ago

During the harvesting process, we download a source distribution from PyPI to perform further analysis, such as running the licensee, reuse, and ScanCode tools. If a source package is not available, the package is currently marked as missing. This behavior was introduced in this PR to address this issue.

When a package is marked as missing during the harvest, there is no information stored regarding the downloaded registry information for that PyPI package. In addition, curation can only be created through a pull request against https://github.com/clearlydefined/curated-data rather than through the user interface.

Due to recent questions about harvesting PyPI packages without source distributions, it may be worthwhile to discuss the matter further on the original issue. Should we allow the harvest to succeed even if the source PyPI package cannot be downloaded? Could it be considered the intended behavior for those PyPI packages where no files are displayed on the components details page due to the unavailability of the source package?

@capfei @bduranc @jeffwilcox @elrayle Any thoughts?

elrayle commented 2 weeks ago

@jeffrey-luszcz ☝ See comment responding to issue raised in the community meeting today.

bduranc commented 1 week ago

During the harvesting process, we download a source distribution from PyPI to perform further analysis, such as running the licensee, reuse, and ScanCode tools. If a source package is not available, the package is currently marked as missing. This behavior was introduced in this PR to address this issue.

When a package is marked as missing during the harvest, there is no information stored regarding the downloaded registry information for that PyPI package. In addition, curation can only be created through a pull request against https://github.com/clearlydefined/curated-data rather than through the user interface.

Due to recent questions about harvesting PyPI packages without source distributions, it may be worthwhile to discuss the matter further on the original issue. Should we allow the harvest to succeed even if the source PyPI package cannot be downloaded? Could it be considered the intended behavior for those PyPI packages where no files are displayed on the components details page due to the unavailability of the source package?

@capfei @bduranc @jeffwilcox @elrayle Any thoughts?

Can I assume in this context, that the "normal" package files (i.e. binary/deployable code) are still being retrieved and scanned?
Or is this what you are referring to by "source distributions"? I ask because for other types like Maven and Debian, we do harvest the source separate from the binary artifact as their own (source archive) definitions, but they don't hold each other up.

In either case, I think what's important is we have some clear way of notifying end-users the reason why they can't see the files. And if it's due to a tool error (as was discussed in https://github.com/clearlydefined/website/issues/964), then I consider this as different than the files just "not being available". We of course cannot consider it "succeeded" in such cases.