disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

feat: webapi for publication #65

Closed andreawwenyi closed 4 years ago

andreawwenyi commented 4 years ago

@pm5 @chihaoyo I need some advice! As publication takes a lot of time to read from database, the way I'm doing it right now is to save all the matched publications during the first time user sends the query, and then when user query different pages, it'll just load from the saved list of publications. Not sure how memory-intensive it's going to be when publication db grows larger so I wonder if there's a better way to do it. In addition, I think I need a mechanism to clear the saved dictionary (self.pubs), so that users can get updated results for the same search string. But currently I don't know how to do it.

Some stats: "高圓圓" : 223 publications, takes 52.87s to load everything "蔡英文" : 43,310 publications, takes 59.01s "台灣" : 203,882 publications, takes 2m 13s

I have tried using the limit and offset in sql but it doesn't really reduce the time to read and each page takes roughly to same amount of time so I figure it's not very ideal.

pm5 commented 4 years ago

Let me think about this a bit.

pm5 commented 4 years ago

Let me see if MySQL full-text index for Chinese after 5.7.6 could speed things up. If it works, then we don't have to make our code more complicated.

andreawwenyi commented 4 years ago

@pm5 okay! I have tried with MATCH...AGAINST syntax, but it missed a lot of matches...

pm5 commented 4 years ago

Let's see if 88998209dc9fc395d82fb20cd4b8c6d6f565ecdf works? Must run this migration on ArticleParser before trying it.

If it works on our dev machines, we then have to persuade the production db to build the search index.

pm5 commented 4 years ago

After some experiments I think MySQL full-text index is not really a good way to go. Other than adding more services, say Redis or ElasticSearch, to the stack, your solution is a far better approach.

andreawwenyi commented 4 years ago

Added a check point to refresh in-memory publications after 1 hour in commit 9094b19. Ready to merge into master.