issues
search
Nixes
/
PageScraper
Library that uses a heuristic to find and return the main contents of a news article. Algorithm developed by @Nixes.
8
stars
0
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add GitHub actions support
#41
Nixes
closed
2 years ago
0
Separate metadata scraping from article content scraping
#40
Nixes
opened
2 years ago
0
Tag extraction
#39
Nixes
closed
3 years ago
1
refactor caching out to own class
#38
Nixes
closed
4 years ago
0
Composerfy
#37
Nixes
closed
4 years ago
0
publish composer package and simple class usage example
#36
Nixes
opened
4 years ago
0
scraping failure case
#35
Nixes
opened
6 years ago
0
When using json output second load of page (from cache) produces incorrect output
#34
Nixes
closed
6 years ago
1
debug=true no longer works
#33
Nixes
closed
6 years ago
0
Separate class files
#32
Nixes
closed
6 years ago
0
Amp support
#31
Nixes
closed
6 years ago
0
Use AMP page where available
#30
Nixes
closed
6 years ago
0
Oop ify
#29
Nixes
closed
7 years ago
0
Fix regressions
#28
Nixes
closed
7 years ago
0
add simple file based caching
#27
Nixes
closed
7 years ago
3
Add separate pre-processor for journal style pdf formatting
#26
Nixes
opened
7 years ago
0
Lists are not properly extracted when they are in a blockquote
#25
Nixes
opened
7 years ago
0
Use twitter meta tags as another metadata source
#24
Nixes
opened
7 years ago
0
refactor checkNode
#23
Nixes
closed
8 years ago
1
detect when scraping has failed
#22
Nixes
closed
8 years ago
1
failure to correctly convert relative url to absolute one when base address is a redirect
#21
Nixes
closed
8 years ago
0
add ability to detect and scrape multiple page stories
#20
Nixes
opened
8 years ago
1
update algorithm to be immune to confusion by suggested articles
#19
Nixes
closed
7 years ago
1
support for embedded video extraction
#18
Nixes
opened
8 years ago
1
Improve styling of blockquotes.
#17
Nixes
closed
8 years ago
0
support parsing figure tags.
#16
Nixes
opened
8 years ago
0
Blacklist refactor complete
#15
Nixes
closed
8 years ago
0
Add travis and some unit tests
#14
Nixes
closed
8 years ago
3
Error Cases
#13
Nixes
opened
8 years ago
2
simple configuration file
#12
Nixes
opened
8 years ago
0
Improve error handling!
#11
Nixes
closed
8 years ago
1
Selective image compression
#10
Nixes
opened
8 years ago
1
Sometimes quoted text gets repeated
#9
Nixes
closed
8 years ago
1
Strange formatting issues
#8
Nixes
closed
8 years ago
4
Make DOM id/class blacklist external + refactor removeJunk
#7
Nixes
closed
8 years ago
1
Comment section scraper?
#6
Nixes
closed
8 years ago
1
Potential shortcuts for finding content in pages?
#5
Nixes
closed
8 years ago
1
Detect Relative links and convert to absolute links
#4
Nixes
closed
8 years ago
3
Author scraping
#3
Nixes
closed
8 years ago
1
Add title scraping
#2
Nixes
closed
8 years ago
1
Failure Cases
#1
Nixes
opened
8 years ago
2