disinfoRG ZeroScraper issues

disinfoRG / ZeroScraper

Web scraper made by 0archive.

https://0archive.tw

MIT License

10 stars 2 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

site.py include arguments to configure start_url, article, following regex for discover

#98 andreawwenyi closed 4 years ago
0
A customized url normalizer for our article

#97 andreawwenyi opened 4 years ago
0
the initium log in issue

#96 andreawwenyi opened 4 years ago
5
create new spiders for appledaily

#95 andreawwenyi closed 4 years ago
1
recover from data loss

#94 pm5 closed 4 years ago
5
publish full-text datasets for selected sites

#93 pm5 closed 4 years ago
1
Add stats to api

#92 andreawwenyi closed 4 years ago
0
accelerate process to remove duplicate article

#91 andreawwenyi closed 4 years ago
5
Need to sign in for appledaily for complete news content

#90 andreawwenyi closed 4 years ago
0
calculate daily number of updates and discovers by sites and add to new db table

#89 andreawwenyi closed 4 years ago
0
seems to not crawling enough articles from news websites

#88 andreawwenyi closed 4 years ago
1
move airtable base id into env variable and update instruction

#87 andreawwenyi closed 4 years ago
0
`ns update` seems be killed often

#86 pm5 closed 4 years ago
2
Re-enable parallel crawling in site discover/update

#85 pm5 opened 4 years ago
2
dcard update: slow to prepare posts_to_update

#84 andreawwenyi closed 4 years ago
1
Error doing update on Dcard

#83 pm5 opened 4 years ago
3
Do we really need selenium to read kknews and read01?

#82 pm5 opened 4 years ago
2
ArticleSnapshot table rotation

#81 pm5 closed 4 years ago
3
Remove batch_discover.py and execute_spiders.py

#80 pm5 closed 4 years ago
0
In-memory filter to exclude duplication articles

#79 pm5 closed 4 years ago
3
Handle 404 for article snapshot update

#78 pm5 closed 4 years ago
1
Crawled https://www.medicalnewstoday.com/ from https://anntw.com/

#77 pm5 closed 4 years ago
1
Several oddities to the MySQL db

#76 pm5 opened 4 years ago
1
Stop updating snapshot for articles with too many 404

#75 pm5 closed 4 years ago
1
update parallelism issue: many connection errors

#74 andreawwenyi opened 4 years ago
2
update readme for airtable spider

#73 andreawwenyi closed 4 years ago
0
update airtable spider

#72 andreawwenyi closed 4 years ago
0
Run sites update in parallel

#71 pm5 closed 4 years ago
0
ns.py update 無法處理404

#70 andreawwenyi closed 4 years ago
2
404 handling

#69 andreawwenyi closed 4 years ago
0
command line interface: for issue #57

#68 andreawwenyi closed 4 years ago
0
sql connection error while using ns.py update

#67 andreawwenyi closed 4 years ago
1
新聞網站

#66 andreawwenyi closed 4 years ago
2
feat: webapi for publication

#65 andreawwenyi closed 4 years ago
6
add site_id to crawler stats of discover spider

#64 andreawwenyi closed 4 years ago
0
update spider - only use selenium when necessary for the particular site

#63 andreawwenyi closed 4 years ago
1
dcard 404 handling

#62 andreawwenyi closed 4 years ago
1
Monitor api

#61 andreawwenyi closed 4 years ago
4
Fix Python dependency problems

#60 pm5 closed 4 years ago
0
Monitoring API

#59 pm5 closed 4 years ago
1
Add site ID or name to scrapy log

#58 pm5 closed 4 years ago
1
Organise executables/commands we have

#57 pm5 closed 4 years ago
3
make update site_wise

#56 andreawwenyi closed 4 years ago
1
dcard 被移除文章處理

#55 andreawwenyi closed 4 years ago
2
fix issue of missing content

#54 andreawwenyi closed 4 years ago
0
Manage PID with db

#53 pm5 closed 4 years ago
1
Update scraper takes more than 24 hours to run

#52 pm5 closed 4 years ago
0
Selenium start is quite slow on middle2

#51 pm5 opened 4 years ago
0
Dcard spider seems to only get the excerpt of a post

#50 pm5 closed 4 years ago
1
make update_content_spider more flexible

#49 andreawwenyi closed 4 years ago
2

Previous Next