Closed milo-trujillo closed 5 years ago
We may not need checkpoints in data collection. Right now the acquiring script says:
for layer in layers:
for user in usersForLayer(layer):
if( we don't have tweets for user yet ):
get tweets and analyze
This assumes that if we already have the tweets for user foo then foo was part of a previous layer, and so we should skip them and not include them now.
Instead, let's make the list of users explicit:
for layer in layers:
for user in usersForLayer(layer):
if( we don't have tweets for user yet ):
get tweets
if( user not in a previous layer ):
analyze tweets
What this means is if we are interrupted and the script restarts it will see "We already have tweets saved for a bunch of users, let's use the cached data instead of re-downloading everything".
From here it should be much easier to add functionality for "restart from layer 2" and quickly recover to the state we were interrupted at.
The behavior above has been in master for some time. We now recover from mid-layer when collecting tweets, and at the layer boundary when analyzing tweets. Considering the issue closed.
Since this is a very long running program it is not unlikely that execution will be halted partway through. Where are good points to checkpoint progress, and how can we detect checkpoints and recovery from them automatically when the process is resumed?