disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

Recover lost PTT data #105

Closed pm5 closed 4 years ago

pm5 commented 4 years ago

In #94 we have calculated the amount of data lost in February that was not auto-recoverably by the system. The results are in the SnapshotLost table.

PTT, however, has a lot of 3rd party archives. We lost

site_id COUNT(*)
98 512
99 3328

snapshots. From their URLs we should be able to recover their snapshots from places like https://www.pttweb.cc/bbs/HatePolitics/M.1574072804.A.2DE

andreawwenyi commented 4 years ago

update: given that the html of pttweb is too fat and would not fit in the raw_data field of ArticleSnapshot table, we opted pttread instead to recover the missing snapshot. For example, https://www.ptt.cc/bbs/Gossiping/M.1582100863.A.FFF.html can be recovered from https://pttread.com/gossiping/m-1582100863-a-FFF

the code for recovering snapshots is here and is currently executing.

andreawwenyi commented 4 years ago

finished collecting 1 snapshot for each ptt lost articles, checked by the following sql:

select article_id, count(*) from ArticleSnapshot
where article_id in 
(select article_id from SnapshotLoss
where url like "%ptt.cc%")
group by article_id;

closing this issue.

pm5 commented 4 years ago

I'm dropping SnapshotLoss table and Snapshot (was used to calculate SnapshotLoss) from db since this is resolved.

andreawwenyi commented 4 years ago

@pm5, I think we still need SnapshotLoss table to know which which ptt articles used the pttread website? On Tue, Apr 14, 2020 at 21:29 Pomin Wu notifications@github.com wrote:

I'm dropping SnapshotLoss table and Snapshot (was used to calculate SnapshotLoss) from db since this is resolved.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_disinfoRG_ZeroScraper_issues_105-23issuecomment-2D613763689&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=wieynlWpSInfvirzhMI7vQ&m=5N-xRbKI4nTZGSIyJWSAe0jBoyYEEjSaXv2oWxYnXxs&s=BbS4HGgLuwZjysX6XBVapYQvCjymfapQmUPuGJd1i6w&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHWPRTU5ZVQJSWLY54VNFT3RMUEWDANCNFSM4MCVY65A&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=wieynlWpSInfvirzhMI7vQ&m=5N-xRbKI4nTZGSIyJWSAe0jBoyYEEjSaXv2oWxYnXxs&s=4cuLQtmLBf-IHUPEC01JLhls_pEXIrA4ondSvtOvFig&e= .

-- Andrea Wang Master of Science in Data Science, 2019 LinkedIn https://www.linkedin.com/in/andrea-w-wang-216710119/ | Github https://github.com/andrea-w-wang/

pm5 commented 4 years ago

Oh, okay. Let's keep it before we work it out (we will need to archive this information in some way eventually.)