Mention existing crawler implementations

SAP / project-portal-for-innersource

Lists all InnerSource projects of a company in an interactive and easy to use way. Can be used as a template for implementing the "InnerSource portal" pattern by the InnerSource Commons community.

https://sap.github.io/project-portal-for-innersource/

Apache License 2.0

143 stars 71 forks source link

Mention existing crawler implementations #21

Closed dellagustin-sap closed 3 years ago

dellagustin-sap commented 3 years ago

I found this crawler implementation: https://github.com/zkoppert/innersource-crawler It would be nice to mention it and maybe add a comment on the contributing guidelines for people to send PRs with a link if they implement a crawler.

zkoppert commented 3 years ago

see work happening at https://github.com/SAP/project-portal-for-innersource/pull/18

spier commented 3 years ago

More motivation to finish continue work on that PR :)

spier commented 3 years ago

@JustinGOSSES @zkoppert in the meantime we got around to create dedicated documentation about the Crawling process. I am sure it would greatly benefit from your review, as you have looked into the crawling topic already. See: https://github.com/SAP/project-portal-for-innersource/blob/main/docs/CRAWLING.md

Cheers :)

Michadelic commented 3 years ago

fixed with #18

JustinGOSSES commented 3 years ago

Read through it and seemed very understandable.

Only think of a few small additions that might be useful ....and are more future additions than blockers (1) a reference way to combine results from multiple crawlers running against different code platforms (2) what happens if a code repository exists on multiple code platforms scanned (3) a GitLab to GitHub mapping of the key:value pairs that come out of crawlers. (4) add link to a suitable GitLab crawler once one exists.

Most of these points are for a large distributed organization that might have many internal GitHubs and GitLabs.

spier commented 3 years ago

Thanks for the review @JustinGOSSES.

Can you say more about this point

(3) a GitLab to GitHub mapping of the key:value pairs that come out of crawlers.

What would this be used for? Is it related to writing a dedicated Gitlab crawler (point 4)?

JustinGOSSES commented 3 years ago

Yes. Point 3 should really have two parts I guess, (a) a mapping used to convert the keys from GitHub names to GitHub names and then (b) an actual script to do the conversion.

This would then feed into point 1, a reference way to combine results from multiple crawlers running against different code platform instances.

Michadelic commented 3 years ago

We query GitHub and Git/Gerrit instances using the same crawler script in our environment. The fields from Gerrit need to be partially mapped or recalculated as some stats and concepts do not exist there but it's pretty straightforward. I would guess for GitLab oder other stacks it is probably similar. We could add documentation for such mappings in the crawler documentation as we get more implementations with other stacks.