disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

Discrepancies between site stats and snapshot backup record counts on a few dates #116

Closed pm5 closed 4 years ago

pm5 commented 4 years ago

Numbers on the following days do not match:

date backup snapshot count site stats snapshot count
4/23 16,578 23,423
4/24 3,872 99,991
4/29 9,349 67,465
5/1 37,117 52,707
5/2 10,041 38,590
5/6 2,532 72,534
5/7 5,382 52,314
5/8 27,532 48,463

Tried running S3 backup script for 4/23 a few times but the number stays the same, so it is reproducible.

pm5 commented 4 years ago

Umm, I couldn't figure out what's wrong so I tried the upload script a few more times. Turns out the numbers do change. So it looks like it has something to do with network timeouts and file generation speed. Since our server hard disk has a few GBs to spare, I am going to change the upload script to first generate the backup file on disk, and then upload it. Umm, no. It has something to do with the dump process being killed 🤔 Now it looks more like we haved used up server memory.

pm5 commented 4 years ago

Yep, I am fairly sure that it was because the memory was used up, after monitoring the uploading process in top. There seems not enough memory to start with. I will try to re-upload the above snapshots, but we cannot resolved this before we fix the memory problem.

pm5 commented 4 years ago

Re-uploaded all snapshot backups with mismatching record numbers. Also adjusted to a smaller default batch size when exporting data, which helped a little with the server memory problem.

But there are still some (much smaller) differences between backup and stats record numbers. There might be some other problems involved.

pm5 commented 4 years ago

No discrepancies in numbers were found for almost a month. I think we can close this now.