TeMU-BSC / iberifier

2 stars 0 forks source link

better query builder, WIP #14

Closed Oliph closed 2 years ago

Oliph commented 2 years ago

Trying to build one query from all bigrams, rather than one per bigrams. Also it should split the right size to follow the limit of 1024. Not tested

cuquiwi commented 2 years ago

1) Avoid when possible (almost always) the while True
2) I think the code is over complicated and can be simplified. I can do it 3) We shall consider if we do a request for bigram or we join the bigrams to try to make 1 request per claim. Each point has some pros and cons... E.g. 1 request per bigram - we can associate the tweets to a search and if the search is too general and generates a lot of noise we can quickly dismiss it. But we make more requests per minute and will have problems with rate limits. And the claims with more bigrams will potentially have more tweets. The other way we limit the problems of requests per minute we balance the tweets numbers per claim, and limit the problems with the requests rate limit. But we lose the ability to associate a tweet to a bigram.

We need to decide what do we want. Personally I prefer 1 request per bigram.

Oliph commented 2 years ago

[1, 2]. Well, this while true is controlled by the StopIteration, I have the feeling this case falls under the !(almost always). But more than happy to have a simplified version as you suggested. I am not even sure it works anyway as I haven't tested it.

[3] No, the main problem with one query per bigram is the potential huge increase in duplicated tweets in the data collection as we will recollect the same over and over every time potentially hitting way faster the rate limit per day (month) on tweets. So not one request per bigram, the objective is to minimise the number of queries while we can keep a sort of control on which queries generate too much noise.

cuquiwi commented 2 years ago

I simplified the code. Still needs to be tested