fhamborg news-please issues

fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works

Apache License 2.0

1.98k stars 412 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Add robustness for pages missing title

#274 yldoctrine opened 1 hour ago
0
Remove newspaper3k from setup

#273 Medno opened 1 hour ago
0
Fix type annotations for earlier Python versions

#272 Medno closed 8 hours ago
2
Add an additional sitemap check in Sitemap crawlers

#271 Medno closed 1 day ago
3
Add 4 date metadata patterns in date extraction

#270 anteverse closed 2 days ago
2
Add Redis pipeline

#269 anteverse closed 2 days ago
5
Add unique constraint on column `url` on table `CurrentVersions` in Postgres pipeline

#268 anteverse closed 1 week ago
3
Cast get_max_url_file_name_length result to int

#267 Medno closed 1 week ago
0
Allow RssCrawler to fallback if there was an empty RSS feed

#266 Medno closed 1 week ago
4
Use SPIDER_MODULES field from configuration file if it's defined

#265 Medno closed 3 weeks ago
2
Use newspaper4k instead of abandoned newspaper3k

#264 t1h0 closed 3 weeks ago
3
Use newspaper4k instead of abandoned newspaper3k

#263 t1h0 closed 3 weeks ago
3
Add the possibility to define a schema for postgresql database

#261 yldoctrine closed 4 weeks ago
7
Unable to change URLS from example URLS

#260 rinaforristal closed 4 weeks ago
1
ImportError: libpq.so.5: cannot open shared object file: No such file or directory

#259 Pasanlaksitha closed 2 months ago
1
Reuter news scrip failed

#258 pepingreat closed 2 months ago
1
maintext article attribute length limitation

#257 zurek11 closed 7 months ago
1
Make SimpleCrawer.fetch_urls() accept custom timeout

#256 StepinSilence closed 7 months ago
1
Implement user agent functionality similar to News Paper 3k

#255 GiridharRNair closed 6 months ago
0
can not extract main text.

#253 simplew2011 closed 7 months ago
1
fixes #172 and #169: NewsPlease.from_urls() - use multiprocessing

#252 arcolife closed 10 months ago
3
Change Crawlers to RecursiveCrawler with as a library and store to Mongodb

#251 Anhduchb01 closed 2 months ago
1
Unable to Crawl and Save PDF files

#250 simrankaur20 closed 10 months ago
1
Fix tag

#249 wang-haoxian closed 1 year ago
0
Sharockys docker action

#248 wang-haoxian closed 1 year ago
0
Newer version of ElasticSearch API changed a lot

#247 wang-haoxian closed 1 year ago
0
Adaptation for recent version of ElasticSearch Python API

#246 wang-haoxian closed 1 year ago
1
Replace cchardet with the Python 3.11.x compatible faust-cchardet

#245 eliias closed 10 months ago
0
`NewsPlease.from_urls` updated to behave consistently with 404 urls.

#244 loganamcnichols closed 10 months ago
0
NewsPlease.from_urls behaves inconsistently in situations where a url results in 404

#243 loganamcnichols closed 10 months ago
0
Scrape by Domain

#242 firmai closed 7 months ago
1
Error : You must `download()` an article first!

#241 PYogesh closed 1 year ago
2
Update README.md, add descriptions for JSON

#240 Ahacad closed 1 year ago
0
Specify more recent awscli dependency to avoid dependency resolution issues

#239 phoerious closed 4 weeks ago
8
DateFilter is never used

#238 namlede closed 1 year ago
7
Failed to build for python 3.11

#237 mattiasrubenson closed 1 year ago
3
Get only the recursive list of URLs using the Library mode

#236 bakrianoo closed 1 year ago
2
ModuleNotFoundError: No module named 'newsplease'

#235 ghost closed 1 year ago
3
Proxy Server configuration (HttpProxyMiddleware)

#234 bkrishnap opened 1 year ago
4
Configure options to optimize the crawling and extraction process

#232 kvasilopoulos closed 4 weeks ago
0
news-please at background

#231 noerarief23 closed 1 year ago
2
Temporary failure in name resolution

#229 sara-02 closed 1 year ago
1
Avoiding restart of commoncrawl scraping process

#228 joemkwon closed 2 years ago
0
Fix Elasticsearch package version

#227 bakrianoo closed 1 year ago
1
Common Crawl crawler: adapt to new data access scheme, fixes #223

#226 sebastian-nagel closed 2 years ago
0
Execution neither possible on current Mac OS nor Windows 10

#225 vivianevv closed 2 years ago
0
crawl_from_commoncrawl crashes when attempting to parse the date from warc.paths.gz

#224 Loumstar closed 4 weeks ago
1
Update s3://commoncrawl/ access scheme

#223 sebastian-nagel closed 2 years ago
9
try migrate fsspec

#222 sshleifer closed 2 years ago
0
Required time by commoncrawl extractor and bug in logging

#219 lucadiliello closed 1 year ago
1