issues
search
disinfoRG
/
ZeroScraper
Web scraper made by 0archive.
https://0archive.tw
MIT License
10
stars
2
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
site.py include arguments to configure start_url, article, following regex for discover
#98
andreawwenyi
closed
4 years ago
0
A customized url normalizer for our article
#97
andreawwenyi
opened
4 years ago
0
the initium log in issue
#96
andreawwenyi
opened
4 years ago
5
create new spiders for appledaily
#95
andreawwenyi
closed
4 years ago
1
recover from data loss
#94
pm5
closed
4 years ago
5
publish full-text datasets for selected sites
#93
pm5
closed
4 years ago
1
Add stats to api
#92
andreawwenyi
closed
4 years ago
0
accelerate process to remove duplicate article
#91
andreawwenyi
closed
4 years ago
5
Need to sign in for appledaily for complete news content
#90
andreawwenyi
closed
4 years ago
0
calculate daily number of updates and discovers by sites and add to new db table
#89
andreawwenyi
closed
4 years ago
0
seems to not crawling enough articles from news websites
#88
andreawwenyi
closed
4 years ago
1
move airtable base id into env variable and update instruction
#87
andreawwenyi
closed
4 years ago
0
`ns update` seems be killed often
#86
pm5
closed
4 years ago
2
Re-enable parallel crawling in site discover/update
#85
pm5
opened
4 years ago
2
dcard update: slow to prepare posts_to_update
#84
andreawwenyi
closed
4 years ago
1
Error doing update on Dcard
#83
pm5
opened
4 years ago
3
Do we really need selenium to read kknews and read01?
#82
pm5
opened
4 years ago
2
ArticleSnapshot table rotation
#81
pm5
closed
4 years ago
3
Remove batch_discover.py and execute_spiders.py
#80
pm5
closed
4 years ago
0
In-memory filter to exclude duplication articles
#79
pm5
closed
4 years ago
3
Handle 404 for article snapshot update
#78
pm5
closed
4 years ago
1
Crawled https://www.medicalnewstoday.com/ from https://anntw.com/
#77
pm5
closed
4 years ago
1
Several oddities to the MySQL db
#76
pm5
opened
4 years ago
1
Stop updating snapshot for articles with too many 404
#75
pm5
closed
4 years ago
1
update parallelism issue: many connection errors
#74
andreawwenyi
opened
4 years ago
2
update readme for airtable spider
#73
andreawwenyi
closed
4 years ago
0
update airtable spider
#72
andreawwenyi
closed
4 years ago
0
Run sites update in parallel
#71
pm5
closed
4 years ago
0
ns.py update 無法處理404
#70
andreawwenyi
closed
4 years ago
2
404 handling
#69
andreawwenyi
closed
4 years ago
0
command line interface: for issue #57
#68
andreawwenyi
closed
4 years ago
0
sql connection error while using ns.py update
#67
andreawwenyi
closed
4 years ago
1
新聞網站
#66
andreawwenyi
closed
4 years ago
2
feat: webapi for publication
#65
andreawwenyi
closed
4 years ago
6
add site_id to crawler stats of discover spider
#64
andreawwenyi
closed
4 years ago
0
update spider - only use selenium when necessary for the particular site
#63
andreawwenyi
closed
4 years ago
1
dcard 404 handling
#62
andreawwenyi
closed
4 years ago
1
Monitor api
#61
andreawwenyi
closed
4 years ago
4
Fix Python dependency problems
#60
pm5
closed
4 years ago
0
Monitoring API
#59
pm5
closed
4 years ago
1
Add site ID or name to scrapy log
#58
pm5
closed
4 years ago
1
Organise executables/commands we have
#57
pm5
closed
4 years ago
3
make update site_wise
#56
andreawwenyi
closed
4 years ago
1
dcard 被移除文章處理
#55
andreawwenyi
closed
4 years ago
2
fix issue of missing content
#54
andreawwenyi
closed
4 years ago
0
Manage PID with db
#53
pm5
closed
4 years ago
1
Update scraper takes more than 24 hours to run
#52
pm5
closed
4 years ago
0
Selenium start is quite slow on middle2
#51
pm5
opened
4 years ago
0
Dcard spider seems to only get the excerpt of a post
#50
pm5
closed
4 years ago
1
make update_content_spider more flexible
#49
andreawwenyi
closed
4 years ago
2
Previous
Next