TeMU-BSC / iberifier

2 stars 0 forks source link

Data collection per day #55

Open Oliph opened 1 year ago

Oliph commented 1 year ago

To download data from Twitter and MyNews, the pipeline works as follow:

  1. Check if in keywords collections there is record without the key search_twitter_key (or search_mynews_key)
  2. If results, check the date key and if date < today - days_after. It means, as the data collection need to be done x days before the fact-check date and x days after the fact-check, it has to wait long enough to be able to collect the x days after the fact check.
  3. If true, then it does a query for each days from x days before the claim to x days after the claim).
  4. Record the data
  5. Set the search_twitter_key in the keywords collection with the date of data collection

The advantage is that the pipeline will be able to rerun after a crash without missing some days as the two conditions (not search_key present and the date + days_after < today) can be rechecked at any time. The problem with that approach is all the data collected is done in the past. While not problematic for Twitter. It may create some issues in Mynews (See #47). Ideally, while retaining the advantage of the current methods, it should be able to perform data collection as soon as a claim is recorded in the keywords collection (or maybe a day after) and continue the data collection until days_after date is reached.

clairefurtick commented 1 year ago

One thing we could do is add a field called day_count that acts as a record of how many days after the claim we have already searched. We could also just change the search_twitter_key to function as so and change the initial check in step 1 above to check if the value of search_twitter_key <= x days. Then, we have a while loop that says while day_count <= x && date + day_count <= today, perform a search on date + day_count and then increment day_count by 1.