codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.89k stars 2.1k forks source link

Project dependencies may have API risk issues #954

Closed PyDeps closed 1 year ago

PyDeps commented 1 year ago

Hi, In newspaper, inappropriate dependency versioning constraints can cause risks.

Below are the dependencies and version constraints that the project is using

beautifulsoup4>=4.4.1
cssselect>=0.9.2
feedfinder2>=0.0.4
feedparser>=5.2.1
jieba3k>=0.35.1
lxml>=3.6.0
nltk>=3.2.1
Pillow>=3.3.0
pythainlp>=1.7.2
python-dateutil>=2.5.3
PyYAML>=3.11
requests>=2.10.0
tinysegmenter==0.3
tldextract>=2.0.1

The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict. The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.

After further analysis, in this project, The version constraint of dependency beautifulsoup4 can be changed to >=4.10.0,<=4.11.1. The version constraint of dependency feedparser can be changed to >=6.0.0b1,<=6.0.10. The version constraint of dependency nltk can be changed to >=3.2.2,<=3.7. The version constraint of dependency Pillow can be changed to ==9.2.0. The version constraint of dependency Pillow can be changed to >=2.0.0,<=9.1.1. The version constraint of dependency python-dateutil can be changed to >=2.5.0,<=2.6.1. The version constraint of dependency requests can be changed to >=0.7.0,<=2.24.0. The version constraint of dependency requests can be changed to ==2.26.0. The version constraint of dependency tinysegmenter can be changed to >=0.2,<=0.4.

The above modification suggestions can reduce the dependency conflicts as much as possible, and introduce the latest version as much as possible without calling Error in the projects.

The invocation of the current project includes all the following methods.

The calling methods from the beautifulsoup4
bs4.BeautifulSoup
The calling methods from the feedparser
feedparser.parse
The calling methods from the nltk
collections.OrderedDict.items
collections.OrderedDict
nltk.stem.isri.ISRIStemmer.stem
nltk.download
nltk.data.load
nltk.stem.isri.ISRIStemmer
nltk.tokenize.wordpunct_tokenize
The calling methods from the Pillow
PIL.ImageFile.Parser.feed
PIL.Image.open
PIL.ImageFile.Parser
The calling methods from the python-dateutil
dateutil.parser.parse
The calling methods from the requests
requests.utils.get_encodings_from_content
requests.get
The calling methods from the tinysegmenter
tinysegmenter.TinySegmenter.tokenize
tinysegmenter.TinySegmenter
The calling methods from the all methods
a.is_valid_url
math.fabs
os.path.exists
os.path.join
self.article.extractor.get_meta_data
nodes_with_text.append
self.download
self.parser.getAttribute.strip
summaries.sort
domain_to_filename
newspaper.urls.get_domain
Dispatch.join
self.set_meta_description
self.create
self.parser.getElementsByTag
codecs.open.read
pickle.load
re.sub
urllib.parse.urlparse.startswith
node.itertext
self.clean_body_classes
l.strip
newspaper.urls.valid_url
sorted
keywords
self.parser.stripTags
os.path.isabs
get_depth
raw_html.encode.encode
lxml.etree.strip_tags
p_url.endswith
parse_byline
self.config.get_parser.fromstring
img_tag.get.get_domain
images.Scraper.satisfies_requirements
self.assertFalse
self.get_urls
ExhaustiveFullTextCase.check_url
node.xpath
os.system
url_part.replace.replace
self.parser.previousSiblings
self.set_meta_site_name
bs4.BeautifulSoup.find
self.assertDictEqual
sys.path.insert
concurrent.futures.ProcessPoolExecutor
self.pool.wait_completion
a.is_valid_body
re.findall
set
score
self.is_boostable
conjunction.lower
logging.getLogger.warning
self.links_to_text
nodes.drop_tag
self.article.download
os.path.abspath
w.strip
path.split.split
join.strip
re.split
os.path.getmtime
self.StopWordsKorean.super.__init__
ParsingCandidate
keys.titleWords.sentences.score.most_common
self.set_summary
self.replace_walk_left_right
self.category_urls
tags.append
enumerate
dict.keys
self.get_img_urls
title_text_fb.filter_regex.sub.lower
key.split.split
requests.get.raise_for_status
urllib.parse.urlparse.endswith
self.remove_trailing_media_div
self._parse_scheme_file
w.endswith
self.extractor.extract_tags
nodes_to_remove.append
get_base_domain
self.language.self.stopwords_class.get_stopword_count
utils.StringSplitter
tinysegmenter.TinySegmenter.tokenize
float
self.candidate_words
self.assertCountEqual
self._parse_scheme_http
lxml.html.clean.Cleaner.clean_html
self.get_object_tag
self.extractor.get_authors
node.xpath.remove
x.lower
TimeoutError
self.extractor.get_meta_keywords.split
self.parser.getComments
lxml.etree.tostring
kwargs.str.args.str.encode
self.assertNotEqual
curname.append
urllib.parse.urlsplit
replacement_text.append
self.remove_punctuation
clean_url.startswith
bs4.BeautifulSoup
min
Dispatch
div.insert
child_tld.subdomain.split
img.crop.histogram
_get_html_from_response
node.set
self.parse
nlp.keywords
split.path.split
self.set_text
cur_articles.items
title_piece.strip
codecs.open.readlines
hashlib.md5
len
final_url.hashlib.md5.hexdigest
item.getparent
title.filter_regex.sub.lower
re.match
urls.get_path.startswith
cls.fromstring
f.readlines
summaries.append
split_words.split
nltk.stem.isri.ISRIStemmer
self.parser.childNodesWithText
join.splitlines
self.convert_to_html
self.get_top_node
self.set_meta_keywords
img.crop.crop
outputformatters.OutputFormatter
source.Source.build
raw_html.hashlib.md5.hexdigest
self.remove_negativescores_nodes
bool
self.clean_article_tags
self.parser.nodeToString
open
self.parser.getChildren
node.attrib.get
newspaper.Article
main
cleaners.DocumentCleaner.clean
self.extractor.get_meta_data
clean_url.encode
self.get_parse_candidate
self.get_embed_code
self._get_category_urls
agent.strip
network.multithread_request
range
txts.extend
item.lower
lxml.html.HtmlElement
map
self.get_flushed_buffer
url_to_crawl.replace
self.nlp
collections.defaultdict
cur_articles.keys
self.remove_nodes_regex
self.remove_empty_tags
self.set_top_img_no_check
img_tag.get.get_scheme
list.remove
self.set_article_html
node.clear
self.update_node_count
href.strip
MRequest
newspaper.build.size
random.randint
f.split.split.sort
utils.RawHelper.get_parsing_candidate
self.set_meta_img
self.extractor.get_category_urls
StringReplacement
i.strip
node.getchildren
article.Article.parse
nltk.download
self.set_canonical_link
nlp.load_stopwords
join
queue.Queue
outputformatters.OutputFormatter.update_language
io.StringIO.read
traceback.print_exc
newspaper.Source.clean_memo_cache
codecs.open.close
self.parser.css_select
x.strip.lower
urls.prepare_url
self.text.split
path.FileHelper.loadResourceFile.splitlines
codecs.open.write
self.start
urllib.parse.urlunparse
self.get_resource_path
newspaper.extractors.ContentExtractor
re.compile.sub
utils.memoize_articles
videos.extractors.VideoExtractor
tempfile.gettempdir
self.get_stopwords_class
x.strip
collections.OrderedDict
utils.ReplaceSequence.create
newspaper.languages
config.get_parser.fromstring
self.set_meta_data
urllib.parse.quote
GOOD.lower
sentence_position
freq.items
unit_tests.read_urls
response.raw.read
newspaper.fulltext
self.parser.previousSibling
self.extractor.get_meta_lang
self.convert_to_text
re.search
outputformatters.OutputFormatter.get_formatted
self.tablines_replacements.replaceAll
str_to_image
title_score
configuration.Configuration
string.replace
url_to_filetype.lower
root.index
cls.get_unicode_html
jieba.cut
utils.extend_config
f.read.splitlines
self.get_node_gravity_score
logging.getLogger.critical
clean_url.decode
newspaper.network.sync_request
utils.get_available_languages
dbs
utils.ReplaceSequence.create.append
title_text.filter_regex.sub.lower.startswith
self.largest_image_url
newspaper.Article.download
self.extractor.calculate_best_node
self.extractor.update_language
distutils.core.setup
self._get_canonical_link
int.lower
node.getnext
self.add_siblings
collections.OrderedDict.items
self.replace_with_text
nltk.tokenize.wordpunct_tokenize
self.remove_punctuation.lower
self.tasks.join
self.assertGreaterEqual
self.extractor.get_meta_description
self.setDaemon
splitter.split
str.maketrans
square_image
newspaper.Article.parse
item.getparent.remove
url_to_filetype
config_items.items
get_request_kwargs
function
self.StopWordsChinese.super.__init__
benchmark
property
node.drop_tag
split.path.startswith
self.assertTrue
logging.getLogger.setLevel
img_tag.get.get_path
self.get_siblings_content.append
domain_counters.get
self.parser.setAttribute
codecs.open
self.replace_with_para
max
self.parser.getText.split
index.self.articles.set_html
configuration.Configuration.get_parser
d.strip
self.config.get_stopwords_class
time.time
self.set_imgs
img_tag.get.prepare_url
self.feed_urls
urllib.parse.urlunparse.strip
dict
network.get_html_2XX_only
self.StopWordsHindi.super.__init__
ConcurrencyException
self._generate_articles.extend
utils.ReplaceSequence.create.append.append
content.decode.translate
self.extractor.get_title
prepare_image
self.get_video
WordStats.set_stopword_count
urls.get_domain
self.article.nlp
urllib.parse.urlunsplit
f.split.split
cls.nodeToString
self.extractor.get_publishing_date
parent_nodes.append
qry_item.startswith
mthreading.ThreadPool.wait_completion
self.get_siblings_content
redirect_back
self.extractor.get_urls.get_domain
self._get_title
str
line.strip
self.parser.fromstring
list
logging.getLogger.info
self.extractor.get_urls.prepare_url
self.extractor.get_meta_site_name
soup.find.split
self.download_feeds
self.get_src
self.parser.textToPara
self.extractor.get_urls
sum
logging.getLogger.debug
join.split
logging.getLogger.warn
cur_articles.values
self.config.get_language
int.strip
hashlib.sha1
copy.deepcopy
node.getparent
collections.Counter
self.clean_para_spans
self.parser.getParent
self.parser.remove
self.set_keywords
self.walk_siblings
self.StopWordsJapanese.super.__init__
self.tasks.get
mthread_run
response.raw.close
unittest.main
urls.url_to_filetype
list.extend
ArticleException
Category
source.Source
result.append
mthreading.ThreadPool
bs4.UnicodeDammit
title_text_h1.filter_regex.sub.lower
urls.valid_url
math.log
current.filter_regex.sub.lower
ord
img_tag.get
int
self.extractor.get_favicon
images.Scraper.largest_image_url
key.split.strip
sys.exc_info
method
newspaper.Source.build
node.getparent.remove
super
img_url.lower
self.resp.raise_for_status
executor.map
self.set_top_img
action
newspaper.Source.download
utils.StringReplacement
self.article.extractor.get_meta_data.values
isinstance
extractors.ContentExtractor.calculate_best_node
word.isalnum
self.parser.getText.sort
utils.cache_disk
self.clean_em_tags
videos.extractors.VideoExtractor.get_videos
os.remove
self.extractor.get_meta_type
self.set_feeds
self.set_html
pow
self.assertRaises
parsed.query.split
requests.get
os.mkdir
is_dict
p_url.startswith
PIL.ImageFile.Parser
search_str.strip.strip
newspaper.Source.parse
self.throw_if_not_downloaded_verbose
self.update_score
url_part.lower.startswith
func
dateutil.parser.parse
get_available_languages
unittest.skipIf
title.TITLE_REPLACEMENTS.replaceAll.strip
unittest.skip
urls.get_path.split
self.parser.createElement
tldextract.tldextract.extract
self._map_title_to_feed
urllib.parse.urlparse.split
Dispatch.error
logging.getLogger
re.compile.search
list.append
item.title
self.parser.getElementsByTags
n.strip
nlp.summarize
sbs
newspaper.hot
utils.extract_meta_refresh
PIL.Image.open
all
tld_dat.domain.lower
response.headers.get
setattr
title_text.filter_regex.sub.lower
content.encode.encode
pickle.dump
txt.innerTrim.split
newspaper.news_pool.join
print
rp.replaceAll
sys.exit
copy.deepcopy.items
urllib.parse.parse_qs.get
hasattr
mock_resource_with.strip
self.parser.isTextNode
int.isdigit
match.xpath
sys.path.append
lxml.html.clean.Cleaner
self.config.get_parser.get_unicode_html
prepare_url
urllib.parse.urljoin
self.get_embed_type
article.Article
key.split.pop
self.calculate_area
self.is_highlink_density
x.replace
memo.keys
self.release_resources
set.update
_authors.append
self.get_width
self.candidates.remove
m_requests.append
re.search.group
self.parser.getTag
self.set_meta_favicon
div.set
self.get_height
urllib.parse.urljoin.append
node.cssselect
format
self.extractor.get_canonical_link
badword.lower
getattr
self.movies.append
self.extractor.get_feed_urls
newspaper.configuration.Configuration
self._generate_articles
f.read
self.parser.outerHtml
re.sub.startswith
is_string
nltk.data.load
self.purge_articles
self.parser.getAttribute
html.unescape
self.pattern.split
threading.Thread.__init__
onlyascii
mthreading.NewsPool
self.parse_categories
self.categories_to_articles
io.StringIO
self.add_newline_to_br
node.itersiblings
parse_date_str
re.compile
a.get
self.parser.drop_tag
utils.clear_memo_cache
hint.filter_regex.sub.lower
top_node.insert
self.title.nlp.keywords.keys
__name__.logging.getLogger.addHandler
self.extractor.get_meta_keywords
s.strip
self.get_siblings_score
self.set_authors
overlapping_stopwords.append
self.set_title
newspaper.Source.category_urls
self.parser.getElementsByTag.get
node.append
os.listdir
self.extractor.get_meta_img_url
self.remove_punctuation.split
failed_articles.append
os.path.dirname
extractors.ContentExtractor.post_cleanup
k.strip
self.StopWordsThai.super.__init__
text.innerTrim
IOError
codecs.open.split
self.extractor.get_urls.get_scheme
extractors.ContentExtractor
get_base_domain.split
fin.read
newspaper.build
self.title.split
self.get_replacement_nodes
tinysegmenter.TinySegmenter
tuple
mock_resource_with
self.replacements.append
prepare_image.thumbnail
utils.FileHelper.loadResourceFile
self.fetch_images
uniqify_list
network.get_html
match.text_content
self.remove_drop_caps
self._get_urls
url_slug.split
ref.get
self.set_reddit_top_img
self.config.get_parser
root.insert
valid_categories.append
newspaper.network.multithread_request
path.split.remove
glob.glob
cls.createElement
self.set_tags
settings.cj
Exception
cleaners.DocumentCleaner
keywords.keys.set.intersection
domain.replace
WordStats.set_word_count
fetch_image_dimension
authors.extend
self.add_newline_to_li
self.get_score
split_words
logging.NullHandler
self.article.parse
self.parser.getElementsByTags.reverse
contains_digits
self.parser.getText
pythainlp.word_tokenize
node.getprevious
self.parser.clean_article_html
re.match.group
zip
kwargs.str.args.str.encode.sha1.hexdigest
self.tasks.put
get_html_2XX_only
words.append
self.config.get_parser.getElementsByTag
self.feeds_to_articles
self.generate_articles
url_part.lower
clean_url
io.StringIO.seek
content.encode.decode
node.lxml.etree.tostring.decode
sb.append
self.language.self.stopwords_class.get_stopword_count.get_stopword_count
image_entropy
attr.self.getattr
self.stopwords_class
utils.ReplaceSequence
self.set_movies
nltk.data.load.tokenize
resps.append
self.parser.replaceTag
self.parser.delAttribute
Dispatch.isAlive
p.lower
self.nodes_to_check
mthreading.ThreadPool.add_task
lxml.html.fromstring
length_score
newspaper.Source.set_categories
next
self.get_provider
nodes_to_return.append
self.remove_scripts_styles
urllib.parse.urlparse
urls.get_scheme
self.pool.add_task
newspaper.popular_urls
url_slug.count
node.self.parser.nodeToString.splitlines
self.parser.xpath_re
WordStats.set_stop_words
self.parser.nextSibling
self.text.nlp.keywords.keys
fetch_url
utils.print_available_languages
WordStats
self.split_title
self.is_media_news
self.StopWordsArabic.super.__init__
uniq.values
newspaper.Source
split_sentences
response.raw._connection.close
self.div_to_para
self.download_categories
self.extractor.get_first_img_url
abs
self.has_top_image
utils.memoize_articles.append
self.clean_bad_tags
utils.StringReplacement.replaceAll
self.set_categories
newspaper.news_pool.set
value.lower
prepare_image.save
self.extractor.post_cleanup
requests.utils.get_encodings_from_content
txts.join.strip
self.extractor.get_img_urls.add
feedparser.parse
self.get_meta_content
ThreadPool
utils.URLHelper.get_parsing_candidate
images.Scraper
u.strip
Feed
e.get
self.assertEqual
urllib.parse.parse_qs
div.clear
prop.attrib.get
url_part.replace
self.setup_stage
PIL.ImageFile.Parser.feed
matches.extend
newspaper.urls.prepare_url
memo.get
self.set_meta_language
self.extractor.get_img_urls
videos.Video
self.article.extractor.get_meta_type
nltk.stem.isri.ISRIStemmer.stem
self.tasks.task_done
domain.replace.replace
Worker
self.set_description
self.throw_if_not_parsed_verbose
l.strip.split

@developer Could please help me check this issue? May I pull a request to fix it? Thank you very much.

banagale commented 1 year ago

This project is not actively maintained. Consider forking and making the dependency fixes for yourself.

daaniikusnanta commented 1 year ago

Are there any alternatives to this? I'm looking for an easy way to get articles and news from selected news web and this project is perfect. But, seeing that this project is not maintained made me worry. Thanks in advance!

@banagale

This project is not actively maintained. Consider forking and making the dependency fixes for yourself.

sfavorite commented 1 year ago

I switched to https://github.com/goose3/goose3

Works great.

PyDeps commented 1 year ago

ok.