apache / incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
https://devlake.apache.org/
Apache License 2.0
2.51k stars 495 forks source link

[Bug][Lake] CodeReview (pull requests) data pulled by lake app does not honour the sync policy time range in config-ui #7079

Open sayeedhussain opened 4 months ago

sayeedhussain commented 4 months ago

Search before asking

What happened

CodeReview (pull requests) data pulled by lake app does not honour the sync policy time range in config-ui. Additional data is getting pulled from github. Refer screenshots.

Screenshot 2024-03-01 at 11 33 39 AM Screenshot 2024-03-01 at 11 33 29 AM

What do you expect to happen

The sync policy time range in config-ui should be honoured by lake app while pulling data.

How to reproduce

  1. Create a project with sync policy time range of 3 months.
  2. Create a github datasource connection with scopeconfig for CodeReview. Ensure the repository has pull requests for more than 3 months in the past.
  3. Collect data
  4. View pr.created dates for the project in MySQL
  5. PRs with created date before 3 months is also available

Anything else

No response

Version

0.21.0-beta5

Are you willing to submit PR?

Code of Conduct

d4x1 commented 4 months ago

That's beacuse DevLake fetches GitHub pull request via its graphql API, and this API doesn't support createdAt filter. So Devlake collect all pull requests.

image

Maybe we can collect pull request via search API, just like this: https://github.com/orgs/community/discussions/24611 . We can vote ont this matter.

sayeedhussain commented 4 months ago

thanks for the analysis @d4x1. For now, we are able to work around this issue by fixing our mysql queries.

But in general, I think it would be good to fix this so that there are no *special conditions about honouring sync policy time range.

d4x1 commented 3 months ago

@sayeedhussain IMO, using search API is not the right way. I think we should ask GitHub to update its graphql API, it will take too long. Maybe we can filter out records that don't satisfy the time range. @abeizn Will it lead to other problems?

sayeedhussain commented 3 months ago

@d4x1 Sure. My suggestion was that it will be good to fix the issue. How to fix is best decided by you/team. thanks!

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.