Closed lukaszgryglicki closed 4 years ago
OK, starting work on this.
@lukaszgryglicki are you with this issue? if needed I can take it over. Let me know, thanks
I need to work on something else at least one day, so you can take it.
Hi @lukaszgryglicki,
I'm not sure the sleep for rate can be useful in this case, since the remaining time before issuing another request to the API isn't returned in the response (or at least, I didn't find it). In token-based backends (e.g., github and gitlab), that time is included in the response, thus it's pretty easy to stop and resume the fetching process. It would be great if that info would be included in the API response. In the meanwhile, you can pull the data from groups.io every 2-3 days (we are following this approach ATM)
I'm not sure about the usefulness of start_msg_num
, as a final user it would make more sense to get the messages after a given date. In this sense, I dropped a message here: https://beta.groups.io/g/api/topic/feature_request_add/69933215?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,69933215
Best, Valerio
@lukaszgryglicki the discussion on the groupsio link above is ongoing, feel free to jump on it if you see it useful.
I've seen the discussion, is there any problem to implement resuming using last message ID instead of date? And "sleep for rate" needs some work on groups.io side, right? Like returning the number of seconds before rate limit expires, right?
I've seen the discussion, is there any problem to implement resuming using last message ID instead of date?
It requires more effort, since the current backend isn't offset-based. The new backend could look like the NNTP one: https://github.com/chaoss/grimoirelab-perceval/blob/master/perceval/backends/core/nntp.py.
And "sleep for rate" needs some work on groups.io side, right? Like returning the number of seconds before rate limit expires, right?
Yes, right!
So, any decisions? Like do we wait for groups.io to implement "timestamp" approach in their API? -- or -- We are implementing last message ID approach
How about rate for sleep?
Do we want groups.io to implement returning "time to reset" value?
-- or --
We're skipping support for sleep for rate?
-- or --
Are we trying to implement --sleep-for-rate
flag to specify maximum wait time (like for example 600 seconds) but during that 600s we're checking if rate limit expired every 10s or 30s?
Hi @lukaszgryglicki
Like do we wait for groups.io to implement "timestamp" approach in their API?
Yes
Do we want groups.io to implement returning "time to reset" value?
Yes
In the meanwhile, the data from groups.io can be downloaded every X days to avoid stressing their servers.
WDYT?
And it works that way currently, we just get rate limit error but then, after some time we're OK. The biggest issue is that we download all the data again and again... instead of resuming from the last message.
Did groups.io guys implemented the timestamp approach, if not do they plan to do that or should I attempt to implement start_msg_num
approach. Anybody know which field saved by grimoire's groups.io backend is the actual start_msg_num
so I can search for it in ES index, find the highest and pass it to the backend to start resuming with?
Hi @lukaszgryglicki,
According to the last message at https://beta.groups.io/g/api/topic/feature_request_add/69933215?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,69933215, they plan to implement it.
The field is the header X-Groupsio-MsgNum
and it is supposedly included in all messages, however when I tried with the group group onap+onap-zoom-hosts
, I didn't find it.
The same with me, and I didn't see any filed that is similar to msg_num
. So ideally groups.io will support start_msg_date
, so I should ping them, right?
it would be great if you can ping them too, thanks!
I've just got an email:
I've added a
start_timeparameter that takes a date. Cheers, Mark
So you can proceed with the date
approach.
Great! I'll have a look at it, thanks for the info
Thanks!
@lukaszgryglicki I'm working on the fix to use the start_time
param
Great, thanks!
Feature request: groups.io is now rate limiting calls to
/downloadarchives
API--sleep-for-rate
, something similar to GitHub perceval backend.Additionally, groups.io now allows "resuming" data processing. This is the informations we gotfrom groups.io developers: I've added a start_msg_num parameter to the
/downloadarchives
API. Also, in the message headers of the archives downloaded, I've added aX-Groupsio-MsgNum
header with the message number. Download all the archives initially, then scan for the highest MsgNum in what you've downloaded. Use that number for start_msg_num next time you call/downloadarchives
and you'll only get newer messages. Repeat the process each time.