chaoss / grimoirelab-perceval

Send Sir Perceval on a quest to retrieve and gather data from software repositories.
http://perceval.readthedocs.io/
GNU General Public License v3.0
289 stars 177 forks source link

[groups.io] support rate limiting and start_msg_num parameter for /downloadarchives API #607

Closed lukaszgryglicki closed 4 years ago

lukaszgryglicki commented 4 years ago

Feature request: groups.io is now rate limiting calls to /downloadarchives API

Additionally, groups.io now allows "resuming" data processing. This is the informations we gotfrom groups.io developers: I've added a start_msg_num parameter to the /downloadarchives API. Also, in the message headers of the archives downloaded, I've added a X-Groupsio-MsgNum header with the message number. Download all the archives initially, then scan for the highest MsgNum in what you've downloaded. Use that number for start_msg_num next time you call /downloadarchives and you'll only get newer messages. Repeat the process each time.

lukaszgryglicki commented 4 years ago

OK, starting work on this.

valeriocos commented 4 years ago

@lukaszgryglicki are you with this issue? if needed I can take it over. Let me know, thanks

lukaszgryglicki commented 4 years ago

I need to work on something else at least one day, so you can take it.

valeriocos commented 4 years ago

Hi @lukaszgryglicki,

I'm not sure the sleep for rate can be useful in this case, since the remaining time before issuing another request to the API isn't returned in the response (or at least, I didn't find it). In token-based backends (e.g., github and gitlab), that time is included in the response, thus it's pretty easy to stop and resume the fetching process. It would be great if that info would be included in the API response. In the meanwhile, you can pull the data from groups.io every 2-3 days (we are following this approach ATM)

I'm not sure about the usefulness of start_msg_num, as a final user it would make more sense to get the messages after a given date. In this sense, I dropped a message here: https://beta.groups.io/g/api/topic/feature_request_add/69933215?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,69933215

Best, Valerio

valeriocos commented 4 years ago

@lukaszgryglicki the discussion on the groupsio link above is ongoing, feel free to jump on it if you see it useful.

lukaszgryglicki commented 4 years ago

I've seen the discussion, is there any problem to implement resuming using last message ID instead of date? And "sleep for rate" needs some work on groups.io side, right? Like returning the number of seconds before rate limit expires, right?

valeriocos commented 4 years ago

I've seen the discussion, is there any problem to implement resuming using last message ID instead of date?

It requires more effort, since the current backend isn't offset-based. The new backend could look like the NNTP one: https://github.com/chaoss/grimoirelab-perceval/blob/master/perceval/backends/core/nntp.py.

And "sleep for rate" needs some work on groups.io side, right? Like returning the number of seconds before rate limit expires, right?

Yes, right!

lukaszgryglicki commented 4 years ago

So, any decisions? Like do we wait for groups.io to implement "timestamp" approach in their API? -- or -- We are implementing last message ID approach

How about rate for sleep? Do we want groups.io to implement returning "time to reset" value? -- or -- We're skipping support for sleep for rate? -- or -- Are we trying to implement --sleep-for-rate flag to specify maximum wait time (like for example 600 seconds) but during that 600s we're checking if rate limit expired every 10s or 30s?

valeriocos commented 4 years ago

Hi @lukaszgryglicki

Like do we wait for groups.io to implement "timestamp" approach in their API?

Yes

Do we want groups.io to implement returning "time to reset" value?

Yes

In the meanwhile, the data from groups.io can be downloaded every X days to avoid stressing their servers.

WDYT?

lukaszgryglicki commented 4 years ago

And it works that way currently, we just get rate limit error but then, after some time we're OK. The biggest issue is that we download all the data again and again... instead of resuming from the last message.

lukaszgryglicki commented 4 years ago

Did groups.io guys implemented the timestamp approach, if not do they plan to do that or should I attempt to implement start_msg_num approach. Anybody know which field saved by grimoire's groups.io backend is the actual start_msg_num so I can search for it in ES index, find the highest and pass it to the backend to start resuming with?

valeriocos commented 4 years ago

Hi @lukaszgryglicki,

According to the last message at https://beta.groups.io/g/api/topic/feature_request_add/69933215?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,69933215, they plan to implement it.

The field is the header X-Groupsio-MsgNum and it is supposedly included in all messages, however when I tried with the group group onap+onap-zoom-hosts, I didn't find it.

lukaszgryglicki commented 4 years ago

The same with me, and I didn't see any filed that is similar to msg_num. So ideally groups.io will support start_msg_date, so I should ping them, right?

valeriocos commented 4 years ago

it would be great if you can ping them too, thanks!

lukaszgryglicki commented 4 years ago

I've just got an email: I've added astart_timeparameter that takes a date. Cheers, Mark So you can proceed with the date approach.

valeriocos commented 4 years ago

Great! I'll have a look at it, thanks for the info

lukaszgryglicki commented 4 years ago

Thanks!

valeriocos commented 4 years ago

@lukaszgryglicki I'm working on the fix to use the start_time param

lukaszgryglicki commented 4 years ago

Great, thanks!