AndyTheFactory / newspaper4k

đź“° Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
429 stars 37 forks source link

doesn't recognize an article on TechCrunch.com #118

Open AndyTheFactory opened 10 months ago

AndyTheFactory commented 10 months ago

Issue by BanzaiTokyo Thu Aug 3 11:08:00 2017 Originally opened as https://github.com/codelucas/newspaper/issues/410


Example: https://techcrunch.com/2017/08/03/ge-spin-out-smartassist-io-raises-5m-series-a-for-its-ai-based-customer-service-platform/

AndyTheFactory commented 10 months ago

Comment by BanzaiTokyo Fri Sep 1 08:47:40 2017


Hi guys, I am curious if you check your issues? Would you like to comment on this one?

AndyTheFactory commented 10 months ago

Comment by JosephMRally Mon Oct 16 02:11:04 2017


I've noticed this on several other sites too. Seems to be a new way to defeat extractions??

AndyTheFactory commented 10 months ago

Comment by PandaWhoCodes Mon Jan 29 04:11:19 2018


Same problem here. Gives - Error converting a html to string http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=https%3A%2F%2Ftechcrunch.com%2F2018%2F01%2F28%2Ffive-myths-of-seed-investing%2F

AndyTheFactory commented 10 months ago

Comment by ilkerceng Thu Dec 20 22:36:19 2018


will be any update on this?

AndyTheFactory commented 10 months ago

Comment by codelucas Sun Dec 23 13:40:35 2018


Hey, I can't reproduce this. In my mac OSX attempt newspaper did work in extracting the html, full-text, and title from https://techcrunch.com/2017/08/03/ge-spin-out-smartassist-io-raises-5m-series-a-for-its-ai-based-customer-service-platform/

@BanzaiTokyo can you be specific on which functionality is failing? Here is my repro:

~/workspace/newspaper-env » python3                                                                                              
Python 3.7.0
[Clang 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from newspaper import Article
>>> url = 'https://techcrunch.com/2017/08/03/ge-spin-out-smartassist-io-raises-5m-series-a-for-its-ai-based-customer-service-platform/'
>>> a = Article(url)
>>> a.download()
>>> a.parse()
>>> a.nlp()
>>> a.text
'Last November, GE acquired the AI-centric startup Wise.io to bolster its machine learning capabilities. While Wise.io’s core competency was in machine learning, its main product focused on helping enterprises manage customer service requests. Maybe unsurprisingly, GE has now spun out Wise.io in a new company, SmartAssist .io, which will continue to expand the service and work with existing customers like Twilio, MailChimp, ZenDesk, Zenefits and others.\n\nSmartAssist CEO Pradeep Rathinam tells me GE realized that it had acquired a very interesting product, but that it would take a dedicated team and funding to scale it.\n\nAs the SmartAssist team also announced today, Seattle’s Madrona Venture Group has invested $5 million in the company as the sole investor in its Series A round. Madrona managing partner (and former Microsoft executive VP) S. “Soma” Somasegar will join the company’s board, which also includes Wise.io founder and GE Digital’s VP of Intelligent Systems Jeff Erhardt.\n\nLike similar services, SmartAssist uses its AI technology to smartly route service requests to the right human agents. When possible, the service also can respond automatically, based on the ticket’s attributes and the routing rules a company can set for itself. Like all machine learning-based systems, SmartAssist needs a lot of data. Rathinam tells me this means the service works best for companies that handle at least 10,000 support tickets a month.\n\n“Applying ML/AI to intelligently automate use cases and workflows in enterprises is an area where we see a tremendous amount of opportunity and some of our recent investments reflect that investment thesis,” Madrona’s Somasegar writes in today’s announcement. “As we think about beachhead use cases of ML/AI within enterprises, customer support stands out as one of the most tangible areas that could be fundamentally disrupted through technology.”\n\nLooking ahead, the SmartAssist team plans to expand its service to also support chat-based customer service systems — millennials don’t exactly enjoy picking up the phone to talk to a customer service agent, after all, Rathinam noted.'
AndyTheFactory commented 10 months ago

Comment by codelucas Sun Dec 23 13:43:21 2018


The other comments were made a long time ago, so the bug may have been fixed already.

@ilkerceng since you commented latest, can you confirm that you've reproduced this bug and paste your commands/stacktrace? thanks

AndyTheFactory commented 10 months ago

Comment by codelucas Sun Dec 23 13:44:14 2018


Article identification also works as expected:

>>> for a in techcrunch.articles:
...   print(a.url)
... 
https://techcrunch.com/2018/12/22/uber-drivers-settlement/
https://techcrunch.com/2018/12/22/juul-me-twice-shame-on-you/
https://techcrunch.com/2018/12/22/facebooks-fact-checkers-toil-on/
https://techcrunch.com/2018/12/22/the-top-smartphone-trends-to-watch-in-2019/
https://techcrunch.com/2018/12/22/twitters-newest-feature-is-reigniting-the-iphone-vs-android-war/
https://techcrunch.com/2018/12/22/slack-says-it-will-comply-with-sanctions/
https://techcrunch.com/2018/12/21/convo-now-lets-you-see-which-employees-got-the-memo/
https://techcrunch.com/2018/12/21/an-apple-event-but-with-bad-lip-reading/
https://techcrunch.com/2018/12/21/crowdfunded-developer-of-space-sim-star-citizen-takes-on-46m-in-funding-at-nearly-500m-valuation/
https://techcrunch.com/2018/12/21/gaming-chat-startup-discord-raises-150m-surpassing-2b-valuation/
https://techcrunch.com/2018/12/21/norad-santa-tracker-will-stay-on-even-if-the-government-shuts-down/
https://techcrunch.com/2018/12/21/jd-coms-billionaire-founder-richard-liu-wont-be-charged-in-sexual-misconduct-case/
https://techcrunch.com/2018/12/21/the-rodecaster-pro-is-a-perfect-centerpiece-to-a-home-podcasting-studio/
https://techcrunch.com/2018/12/21/a-runaway-gofundme-campaign-to-build-trumps-border-wall-raises-questions-about-its-funding-and-the-future/
https://techcrunch.com/2018/12/21/the-kardashian-apps-are-dead/
https://techcrunch.com/2018/12/21/self-driving-car-startup-zoox-gets-permit-to-transport-passengers-in-california/
https://techcrunch.com/2018/12/21/bellabeats-new-hybrid-smartwatch-tracks-your-stress-and-goes-with-your-outfit/
https://techcrunch.com/2018/12/21/bounce-seed-funding/
https://techcrunch.com/2018/12/21/gift-guide-13-last-minute-gifts-that-you-can-still-get-in-time/
https://techcrunch.com/2018/12/21/sec-slaps-startups-wealthfront-and-hedgeable-with-fines-for-making-false-disclosures/
AndyTheFactory commented 10 months ago

Comment by codelucas Sun Dec 23 13:45:41 2018


@PandaWhoCodes your example works as well, I just checked:

>>> from newspaper import Article
>>> url = 'https://techcrunch.com/2018/01/28/five-myths-of-seed-investing/'
>>> a= Article(url)
>>> a.download()
a.parse()>>> a.parse()
>>> 
>>> a.text
'Pre-seed has risen in prominence in recent months due to the growing gap between what founders are seeking at the seed stage and what the market is offering, yet conversations around pre-seed come with preconceived notions and false assumptions about the companies and investors who care about early stage funding.\n\nTo break down these misconceptions, we’ve assembled a list of 5 common myths about pre-seed and share what’s behind our passion for feeding the ideas of tomorrow’s next great companies.\n\nMyth 1. Pre-Seed Investors Invest in Ideas (and Little Else)\n\nThe term pre-seed investing brings to mind a simple transaction: the founder with a great resume has an idea, the investor writes a check, and it’s no big deal if things don’t work out because it’s just an experiment.\n\nThe misconception is that because companies don’t have traction data, pre-seed investors don’t have much to investigate and thus can’t evaluate deeply. This kind of zombie-like trade is far from reality.\n\nInstitutional pre-seed funds such as Afore believe that pre-seed is just like any other kind of investing, with risks inherent to its stage that can be successfully mitigated. Beyond assessing founder authenticity and market opportunity, we focus on two specific areas: product and distribution. We care about unique product insights and novel distribution approaches and want to know how both will work in the short-term. We’ll learn about what experiments the founders have run to-date to validate their hypotheses, and we keep probing until we hear “I don’t know.” While pre-seeds may not have traction in data, there’s plenty of traction in thought.\n\nMyth 2. Pre-Seed Companies Couldn’t Raise a Real Seed Round.\n\nIt’s assumed that companies seeking pre-seed investment simply aren’t good enough to raise a seed round, and must pare down their pitches and expectations in order to raise a smaller round. This misconception discourages investors from pre-seed opportunities, delivering the wrong message that there’s adverse selection at play because the company knows it’s not good enough to seek a bigger round.\n\nRaising pre-seed funding helps build and distribute the product, providing early traction with the least amount of capital. Founders are increasingly realizing that seed investors do not write the first check––with most seed capital coming 2.4 years after a company’s founding. Afore is part of a new class of pre-seed investors funding pre-product/market fit companies. Startups that lack product/market fit and the ability to scale aren’t ready for seed capital.\n\nThese investors supplement the friends and family round, providing institutional capital previously available much later. Pre-seed founders should raise $500K because it’s better than bootstrapping, and eliminates the potential for the high valuations and dilution inherent with raising large seed rounds.\n\nMyth 3. Pre-Seed Investing is All About Creating Optionality.\n\nAnother myth is that backers in these earliest-stage companies are casual investors who don’t actually know what they’re doing or care about their investments. Similar to an option bet, the idea is that investors have little to lose by placing money across a multitude of opportunities.\n\nNo founder likes to be an option bet or should choose an investor who doesn’t make them a priority. Funds like Afore are active investors exclusively focused on pre-seed who live and die by the success of their portfolios. Pre-seed investment isn’t an option bet to preempt the seed or Series A; it’s their bread and butter.\n\nPre-seed is a burgeoning segment comprised of deeply thoughtful, committed institutional investors that includes pre-seed capital firms like Bee Partners, K9, Pear, Precursor, Notation, Wonder, and many others. Further highlighting marketplace need, PitchBook and the National Venture Capital Association revealed that funding for companies of $1M or less is at its lowest point since 2011.\n\nMyth 4. The Pre-Seed Category is a Fad.\n\nRumblings persist that pre-seed investing is a flash in the pan that will collapse into standard seed investing soon enough. This is an idea based on the inaccurate belief that pre-seed only cropped up due to a bullish investing market.\n\nPre-seed stage companies look very different from seed stage companies in that they don’t have much traction, revenue, or product/market fit. And seed investors are uncomfortable with that level of risk. It’s hard to invest in companies without traction or revenue when compared with companies that possess cohort analysis, accurate LTV/CAC ratios and a strong grasp of their sales funnel. In this apples-to-oranges comparison, seed investors cannot also invest in pre-seed.\n\nAnother factor is the increasing size of seed funds. As fund sizes scaled, seed investors were forced to write bigger checks, pushing seed rounds closer to $5M. Given that the Partner time does not scale with fund size (that is until Elon Musk invents the 30-hour day!), there is no easy way for seed funds to write pre-seed sized checks for $500K then dedicate the time and attention they deserve.\n\nAs long as institutional investors have the appetite, experience and ability to take the “first check risk” well ahead of product/market fit, there will always be a need for the pre-seed round.\n\nMyth 5. Pre-Seed Funds Couldn’t Raise a Real Fund.\n\nMisconceptions about VC funds that focus on the pre-seed stage are also numerous. You may hear: pre-seed firms brand themselves that way because they can’t raise larger funds; they’d actually like to raise seed and series A funding but haven’t been successful; or it was never their true intention to invest in such early stage companies.\n\nExperience tells us otherwise. Our peer GPs all saw the venture trends early as well as the emerging gap in early stage funding and, being entrepreneurs, they took advantage.'
AndyTheFactory commented 10 months ago

Comment by BanzaiTokyo Sun Dec 23 17:50:26 2018


Hey, I can't reproduce this. In my mac OSX attempt newspaper did work in extracting the html, full-text, and title from https://techcrunch.com/2017/08/03/ge-spin-out-smartassist-io-raises-5m-series-a-for-its-ai-based-customer-service-platform/

@BanzaiTokyo can you be specific on which functionality is failing? Here is my repro:

I am away from computer, so I cannot give you more details. But I checked your demo app http://newspaper-demo.herokuapp.com/ and it looks like it has trouble with TechCrunch as well both home and an article link.

AndyTheFactory commented 10 months ago

Comment by codelucas Mon Dec 24 04:53:46 2018


@BanzaiTokyo the demo app is not representative of newspaper3k because the last time it was updated was 2+ years ago and it is behind on the app version. Updating that demo is another independent task that needs to be done though

AndyTheFactory commented 10 months ago

Comment by ilkerceng Fri Feb 1 20:40:42 2019


hi @codelucas , sorry for the latency, i have just realized that there is no error but also does not extract the text, you can try this newspaper source -> https://www.aydinlik.com.tr/ i have tried to understand what can be the reason why it couldn't extract the text from article. I have seen the texts in that source contains many "p" html tags.