clearlydefined / crawler

A service that crawls projects and packages for information relevant to ClearlyDefined
MIT License
43 stars 30 forks source link

Inconsistent file count for crawler image based on node:16 vs node:18-bullseye #529

Closed qtomlinson closed 1 month ago

qtomlinson commented 6 months ago

Expected: The file count for a package should be consistent, independent of the node image versions from which the crawler is based on.

Observed: The file count is different for package pod/cocoapods/-/SoftButton/0.1.0 between the harvested result from node 16 based crawler and the result from node 18 based crawler:

The difference is due to the files in .git directory, specifically: .git/hooks/pre-merge-commit.sample, and .git/hooks/push-to-checkout.sample.

When processing files during harvest, files from .git directory are excluded. This is reflected in clearlydefined.versionx.files section in the harvested data, and the "Files" section in the UI.

Similarly, files under .git directory should be excluded when calculating the file count for the package.

qtomlinson commented 2 months ago

Detailed error message see log.