Interruption can corrupt the cache

FlorentLefebvre commented 3 years ago

When interrupting the process (for a reason or another), the cache can be corrupted and a part of it lost. The recurrence of the corruption seems to be correlated with the size of the cache.

I reproduced it three times in a row with a "friendships.json" of 40Mo, and it append (may be 30% of times?) with a 20Mo file too. When the bug it, i lost about half the cache.

eleurent commented 3 years ago

Yes the caching mechanism is reaaaaaaally not suited for large filed. Basically, for each new user, it reads the full json file, adds a single entry, and writes the whole files again. At some point, the writing is going to take time, and there will be a significant probability that interrupting the process will be during a write operation.

A slightly better system would be to i) write the cache to another (temporary) file, ii) remove the old cache, and iii) rename the new cache. This way, there is always one sane version of the cache on the drive.

And for even bigger graphs, databases should be used.

nadesai commented 2 years ago

+1, I have also experienced this issue.

nadesai commented 2 years ago

One reason fixing this may be valuable - we may not know ahead of time that some of our users have very large "following" lists, or are "limited" as noted in #15. If we discover such users, we will need to interrupt the script, add the problematic user to the excluded.json, and restart.

nadesai commented 2 years ago

If there is interest in switching to a database cache (presumably something like sqlite), https://github.com/dogsheep/twitter-to-sqlite may be useful.

FlorentLefebvre commented 2 years ago

Personaly i simply add a count to update the cache less often. Just to quick fix it and move to another problem.

if i % 3000 == 0: get_or_set(out / target / friendships_file, friendships.copy(), force=True)

nadesai commented 2 years ago

I notice that there are two mechanisms for caching. One is the get_or_set logic which is used for writing to friendships.json as noted in https://github.com/eleurent/twitter-graph/issues/14#issuecomment-1028292601. After doing some logging of its usage, I am not convinced there is much latency here.

The other is set_paged_results which is used by fetch_users_paged to write to followers.json and friends.json. This logic seems somewhat more brittle, especially when the script is rerun. There may also be a bug in this logic. I have needed to rerun my script several times due to discovery of new accounts that are rate-limited or otherwise screwing up fetches; each time, it seems like a new dump of all accounts is appended to the JSON file. Here is the result of some basic analysis on the file (fetching for the account @nikhilarundesai):

$ cat out/nikhilarundesai/cache/friends.json| jq '. | length'
182036

$ cat out/nikhilarundesai/cache/friends.json| jq '.[] | .screen_name ' | sort -u | wc -l
5002

Note that I haven't actually successfully fetched data for all of the 5002 accounts I follow yet in all of these runs, only ~4700, so it is curious that the cache contains 5002 unique names already.

$ cat out/nikhilarundesai/cache/friends.json| jq '.[] | select ( .screen_name == "WHCOVIDResponse")' > whcovidresponse_copies.json
$ cat whcovidresponse_copies.json| jq -s '. | length'
37

Visual inspection of the file shows what looks to be 37 identical copies of the same JSON object representing the account @WHCOVIDResponse (https://twitter.com/WHCOVIDResponse).

FlorentLefebvre commented 2 years ago

I'm working on a new update, deleting the caching system for retrieving followers and friends during process. This cache during process was kind of useless: -not much time gained at retrieving all friends if the process was stopped during fetch_users (compared to the time spend on fetch_friendships), -can't know if it is a mid or final result when restarting... So no more set_paged_results, just one get_or_set at the end of fetch_users, with the full data.

This update will also contained a new way to make edges and a parameters to trigger the caching system at a wanted rate in fetch_friendships (two big performances and corrupting issues when you do "big graph").

I'll propose a MR this week end.

eleurent commented 2 years ago

@nadesai this fetch_user_paged mechanism was introduced in this PR: https://github.com/eleurent/twitter-graph/pull/12 I didn't see any issue while reviewing or testing it, but I haven't tried interrupting the process in the middle. Given your results, there's probably something wrong somewhere, so thanks a lot for raising this.

@FlorentLefebvre if you already have a patch for that (reverting to the previous system, IIUC?), then I think we can wait for it. Please do make several (smaller) PRs if you have several unrelated additions however (like faster edge generation), that will make reviewing and future debugging easier :)

eleurent / twitter-graph

Interruption can corrupt the cache #14