Open leonid-deriv opened 10 months ago
Maybe this is a coincidence, but the data stopped being imported on the date when I ran the group update which you know failed.
One more comment. xxxx/yyyy - not sure why crawler is trying to get this repo. I did execute both REST and GraphQL requests and this repo is not returned by GitHub. This repo is not shown in Monocle WEB interface. And we have never had this repo.
Had to reindex :(
Looks like we have a similar case again. After some "event" it stops importing data :(. symptoms similar to what described before. What I remember that the repository the crawler complains about did not exist ... this time I also cannot find this report. Last time the only solution was to completely rebuild the index but I am afraid this is not a good option. Any idea how we can troubleshoot it. Here is another error message I see regularly in the log
2024-03-07 19:16:09 WARNING Macroscope.Worker:167: Stream produced a fatal error {"index":"xxxx","crawler":"xxx-monocle-xxxx","stream":"Changes","err":["2024-03-07T19:16:09.507202765Z",{"contents":["Unknown GetProjectPullRequests response: GetProjectPullRequests {rateLimit = Just (GetProjectPullRequestsRateLimit {used = 130, remaining = 4870, resetAt = DateTime \"2024-03-07T19:32:37Z\"}), repository = Nothing}"],"tag":"DecodeError"}]}
To me, taking into account that it refers to non existing repo, some internal cache? maybe corrupted. So maybe it is possible to clean it and then I can reset the date to re-scan data? I really do not want to re-index it again, plus now it happened for the second time so probably will happen again :(
another question about last date. Monocle crawlers keep track of the last date (commit date) when a successful document fetch happened.
Where crawler stores this data.
Yes there is a cache. The CLI does not provide a way to clear such entries for no longer existing repositories. Perhaps then you could try to remove the related state object in the Elasticsearch DB https://github.com/change-metrics/monocle/blob/master/src/Monocle/Backend/Index.hs#L204
I am trying to find the index in the elastic where you store this metadata and cannot find it. it is not visible in Kibana - or I am doing something wrong
On Wed, Mar 13, 2024 at 5:32 PM Fabien Boucher @.***> wrote:
Yes there is a cache. The CLI does not provide a way to clear such entries for no longer existing repositories. Perhaps then you could try to remove the related state object in the Elasticsearch DB https://github.com/change-metrics/monocle/blob/master/src/Monocle/Backend/Index.hs#L204
— Reply to this email directly, view it on GitHub https://github.com/change-metrics/monocle/issues/1112#issuecomment-1994415955, or unsubscribe https://github.com/notifications/unsubscribe-auth/A433DJMU7EEMJQD2NYRXA23YYBIPFAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGQYTKOJVGU . You are receiving this because you authored the thread.Message ID: @.***>
I think because that's an object without the usual date field. So you need to select the right parameter in the kibana index pattern creation.
but what is the index name? it is not the index where all workspace data is stored?
On Thu, Mar 14, 2024 at 2:12 PM Fabien Boucher @.***> wrote:
I think because that's an object without the usual date field. So you need to select the right parameter in the kibana index pattern creation.
— Reply to this email directly, view it on GitHub https://github.com/change-metrics/monocle/issues/1112#issuecomment-1997093456, or unsubscribe https://github.com/notifications/unsubscribe-auth/A433DJO44XGXJR6G4ZTLLSDYYFZ2DAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGA4TGNBVGY . You are receiving this because you authored the thread.Message ID: @.***>
same index
Fabien, sorry for the trouble. But I cannot find the required information. What is the document type to cache crawler information?
On Thu, Mar 14, 2024 at 3:09 PM Fabien Boucher @.***> wrote:
same index
— Reply to this email directly, view it on GitHub https://github.com/change-metrics/monocle/issues/1112#issuecomment-1997193994, or unsubscribe https://github.com/notifications/unsubscribe-auth/A433DJNURLYV7QEXSB2D3R3YYGAOXAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGE4TGOJZGQ . You are receiving this because you authored the thread.Message ID: @.***>
Was not very attentive looking at the index. removing a repo from Elastic looks like solved the problem. Should I register a bug for it?
Leonid
On Thu, Mar 14, 2024 at 11:09 AM Fabien Boucher @.***> wrote:
same index
— Reply to this email directly, view it on GitHub https://github.com/change-metrics/monocle/issues/1112#issuecomment-1997193994, or unsubscribe https://github.com/notifications/unsubscribe-auth/A433DJNURLYV7QEXSB2D3R3YYGAOXAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGE4TGOJZGQ . You are receiving this because you authored the thread.Message ID: @.***>
Hi, thanks you to have confirmed this. We can just keep that issue for us to investigate the fact that the crawler stop when a no longer existing is still in the "cache" Such objects can stay in the cache but should not prevent the crawler to process the rest of the repo.
Looking at it, it looks like:
I think we could:
EntityRemoved
error to https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Lentille.hs#L113-L114 Note that the comment above is not correct, it should says
This is likely an error we *can* recover
thank you, the most important is to make this error "non-fatal" so a crawler continues running. And all your 3 points make sense.
On Sun, Mar 17, 2024 at 12:48 PM Tristan de Cacqueray < @.***> wrote:
Looking at it, it looks like:
- the error comes from (when the repositorty is empty): https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Lentille/GitHub/PullRequests.hs#L83
- and this stops the PR crawler because the postResult considered this is a fatal error: https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Macroscope/Worker.hs#L208-L217
I think we could:
- add a new EntityRemoved error to https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Lentille.hs#L113-L114
- cleanup the crawler metadata when it happens in the Worker module
- ignore it to keep the crawler running in https://github.com/change-metrics/monocle/blob/d2232bffee1a381c991854dea7abc6784926137c/src/Macroscope/Worker.hs#L113-L119
Note that the comment above is not correct, it should says This is likely an error we can recover
— Reply to this email directly, view it on GitHub https://github.com/change-metrics/monocle/issues/1112#issuecomment-2002451611, or unsubscribe https://github.com/notifications/unsubscribe-auth/A433DJJ23J47PYIXZE37LZLYYWGKRAVCNFSM6AAAAABCTFY7VSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGQ2TCNRRGE . You are receiving this because you authored the thread.Message ID: @.***>
Any chance to fix this error? The problem is that we are dropping "old" repos and I have to manually remove it from cache every single time :(
What's the easiest way to clear the cache?
I have noticed that a crawler stopped importing data. I see the following errors in the log
Actually, the repository which cannot be found does not exist. I thought it could be cached so I have restarted services but looks like still have the problem any suggestions?