Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.79k stars 273 forks source link

UnicodeDecode Error #94

Closed ploustaunau closed 8 years ago

ploustaunau commented 8 years ago

I am trying to use html2text to clean up html tags on news reports scraped from Google RSS feed. I run into some UnicodeDecode errors. Specifically, I run html2text directly on the command line html2text --ignore-links --ignore-images 52778881361118.htm > test.txt I could not enclose the file as the issue tracker won't take such files. It gives me the following error: Traceback (most recent call last): File "/usr/local/bin/html2text", line 8, in load_entry_point('html2text==2014.9.25', 'console_scripts', 'html2text')() File "/Library/Python/2.7/site-packages/html2text/init.py", line 1083, in main data = data.decode(encoding) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 22495: invalid start byte

Do you have any insight into how I can work around this? Thanks, Philippe

theSage21 commented 8 years ago

I will take a look as soon as i can get my hands on a computer. On Oct 8, 2015 3:19 PM, "ploustaunau" notifications@github.com wrote:

I am trying to use html2text to clean up html tags on news reports scraped from Google RSS feed. I run into some UnicodeDecode errors. Specifically, I run html2text directly on the command line html2text --ignore-links --ignore-images 52778881361118.htm > test.txt I could not enclose the file as the issue tracker won't take such files. It gives me the following error: Traceback (most recent call last): File "/usr/local/bin/html2text", line 8, in load_entry_point('html2text==2014.9.25', 'console_scripts', 'html2text')() File "/Library/Python/2.7/site-packages/html2text/init.py", line 1083, in main data = data.decode(encoding) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 22495: invalid start byte

Do you have any insight into how I can work around this? Thanks, Philippe

— Reply to this email directly or view it on GitHub https://github.com/Alir3z4/html2text/issues/94.

ploustaunau commented 8 years ago

Thank you Arjoonn, BTW, the file that is giving me trouble, in case it helps, is enclosed. The system would not letting me append it. Philippe

On Oct 8, 2015, at 6:07 AM, arjoonn sharma notifications@github.com wrote:

I will take a look as soon as i can get my hands on a computer. On Oct 8, 2015 3:19 PM, "ploustaunau" notifications@github.com wrote:

I am trying to use html2text to clean up html tags on news reports scraped from Google RSS feed. I run into some UnicodeDecode errors. Specifically, I run html2text directly on the command line html2text --ignore-links --ignore-images 52778881361118.htm > test.txt I could not enclose the file as the issue tracker won't take such files. It gives me the following error: Traceback (most recent call last): File "/usr/local/bin/html2text", line 8, in load_entry_point('html2text==2014.9.25', 'console_scripts', 'html2text')() File "/Library/Python/2.7/site-packages/html2text/init.py", line 1083, in main data = data.decode(encoding) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 22495: invalid start byte

Do you have any insight into how I can work around this? Thanks, Philippe

— Reply to this email directly or view it on GitHub https://github.com/Alir3z4/html2text/issues/94.

— Reply to this email directly or view it on GitHub https://github.com/Alir3z4/html2text/issues/94#issuecomment-146482713.

Alir3z4 commented 8 years ago

@ploustaunau you can paste the html file on dpaste or gist for this purpose.

On Thu, Oct 8, 2015 at 1:38 PM, ploustaunau notifications@github.com wrote:

Thank you Arjoonn, BTW, the file that is giving me trouble, in case it helps, is enclosed. The system would not letting me append it. Philippe

On Oct 8, 2015, at 6:07 AM, arjoonn sharma notifications@github.com wrote:

I will take a look as soon as i can get my hands on a computer. On Oct 8, 2015 3:19 PM, "ploustaunau" notifications@github.com wrote:

I am trying to use html2text to clean up html tags on news reports scraped from Google RSS feed. I run into some UnicodeDecode errors. Specifically, I run html2text directly on the command line html2text --ignore-links --ignore-images 52778881361118.htm > test.txt I could not enclose the file as the issue tracker won't take such files. It gives me the following error: Traceback (most recent call last): File "/usr/local/bin/html2text", line 8, in load_entry_point('html2text==2014.9.25', 'console_scripts', 'html2text')() File "/Library/Python/2.7/site-packages/html2text/init.py", line 1083, in main data = data.decode(encoding) File

"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 22495: invalid start byte

Do you have any insight into how I can work around this? Thanks, Philippe

— Reply to this email directly or view it on GitHub https://github.com/Alir3z4/html2text/issues/94.

— Reply to this email directly or view it on GitHub < https://github.com/Alir3z4/html2text/issues/94#issuecomment-146482713>.

— Reply to this email directly or view it on GitHub https://github.com/Alir3z4/html2text/issues/94#issuecomment-146483781.

ploustaunau commented 8 years ago

Here is the file, copied and pasted from the htm file:

HTTP/1.1 200 OK Content-Type: text/html; charset=UTF-8 Transfer-Encoding: chunked X-Type: default X-Cache-Group: normal Date: Fri, 19 Jun 2015 14:56:20 GMT X-Cache: MISS Link: ; rel=shortlink Keep-Alive: timeout=20 X-Pingback: http://www.smarteranalyst.com/xmlrpc.php X-Pass-Why: Cache-Control: max-age=600, must-revalidate X-Cacheable: SHORT Connection: keep-alive Vary: Accept-Encoding,Cookie Server: nginx

• Home
• Analyst Insights
• Contributor Opinions
• Stock News
• Sectors
◦ Healthcare
◦ Services
◦ Technology
◦ Basic Materials
◦ Industrial Goods
◦ Consumer Goods
◦ Financial
◦ Utilities
• Alerts

Login | Connect     

b a ‌ d j C c 1  Bill Gunderson Contributor

Website Bill Gunderson is the CEO and Chief Market Strategist of Gunderson Capital Managment in San Diego, CA. He is also a professional money manager, former research analyst, author of Best Stocks Now, and developer… More >> Seattle Genetics, Inc. (SGEN)’s Collaborative Approach To Fighting Cancer Makes It A Buy June 18, 2015 4:49 AM EDT in Contributor Opinions � Healthcare Seattle Genetics, Inc. (NASDAQ:SGEN) is a biotechnology company involved in developing antibody-based therapies for the treatment of cancer. As the company’s name would suggest, it is headquartered just outside of Seattle in Bothell, Washington. Seattle Genetics has nearly 20 collaborations for its monoclonal antibody-drug conjugate (ADC) technology designed to harness the targeting ability of antibodies to deliver cell-killing agents directly to cancer cells. Companies it is collaborating with include big pharma names like: AbbVie, Bayer, Genentech, GlaxoSmithKline, and Pfizer. Primary Mechanism of Action of ADCs: Targeted Delivery of a Potent Cytotoxic Agent  The company’s flagship product is ADCETRIS (brentuximab vedotin) which is commercially available intravenous drug approved in 2011 for two lymphoma indications and used in 50 countries. Seattle Genetics jointly developed ADCETRIS in collaboration with Takeda Pharmaceutical. Seattle Genetics has exclusive commercialization rights in the U.S. and Canada and Japan-based Takeda has the right to produce the drug in other countries. On June 8th, the company announced an agreement to team up with Cambridge, MA-based Unum Therapeutics. SeaGen made a $25 million upfront cash payment and agreed to put up another $5 million in Unum’s next round of financing. Unum is part of the fast-moving field of T-cell therapy. Other companies in this field (aka CAR-T) include multi-billion-dollar companies Juno Therapeutics�and Kite Pharma who have produced strong results on certain blood cancers in their clinical trials. Unum’s technology is slightly different, engineering T-cells with a surface protein which helps them attach to a wide array of antibodies. The technology is referred to as “antibody-coupled T-cell receptor” technology, or ACTR. The concept of combining ADC and ACTR technologies is an approach that SeaGen hopes will have “broad applicability across a range of cancer targets,” according to SeaGen’s CEO Clay Siegall. The market seemed less than enthusiastic about the partnership, with the stock trading down 4% on the news, but Seattle Genetics’ expansion into immuno-oncology is probably a good long-term strategic investment. SeaGen is revising its guidance for 2015 to account for the effects of the collaboration. Seattle Genetics’ largest shareholder, owning 24% of its stock, is Dr. Felix Baker, the Managing Partner of the Baker Brothers Capital Hedge Fund. The Baker Brothers is banking on SeaGen’s success, and it has a pretty good track record of picking winners such as Pharmacyclics�and Incyte, and YTD it has been a pretty good investment, up almost 45%.  Seattle Genetics’ latest collaboration adds to an already very robust product pipeline. While the company is not yet profitable, analyst estimates continue to ratchet upward. While SeaGen faces potential competition from behemoths like Merck�and Bristol-Myers, its robust pipeline of opportunities, collaborative approach, and the fact that “smart money” like the Baker Brothers have skin in the game, make Seattle Genetics a good investment opportunity.But let’s give Seattle the final litmus test: how does it look as a Best Stock Now?

   Seattle Genetics is a Mid Cap drug stock with an Aggressive risk profile. Its market cap is approximately $6 billion. I am long the stock for my Aggressive Growth clients.

As mentioned, the company is not yet profitable, so traditional valuation metrics are not relevant. The company’s estimated 5-year annual growth rate is 17.4%, but that is likely on the conservative side given the tremendous growth opportunity should SGEN’s cancer approach continue on its current path of success.

YTD, post the announcement of favorable data in December, SeaGen stock is up 49%. Over the last 1 year, the stock is up almost 28% earning it a Momentum Grade of A+ and a Performance Grade of A.

 Out of the more than 4,900 stocks followed in the Best Stocks Now universe, Seattle Genetics ranks #3. This gives it a rating of A+ and it ranks as among the BEST stocks right now.The good news is that there are now a lot of promising cancer therapies out there in the immunotherapy and monoclonal antibody space. Obviously, this creates a lot of speculation and some premium valuations for stocks with exposure to this area. Seattle Genetics is a way to get exposure to these promising research areas, and with the collaboration and backing of some big companies, it seems less risky than many of the other plays out there.

Don’t be too late to the party ��Click Here to see what 4500 Wall Street Analysts�say about your stocks. • Seattle Genetics Inc. • SGEN You May Also Like See what other Wall Street analysts/financial bloggers say about SGEN See the 25 best performing financial bloggers of 2014 See the latest stocks rated today

Related Articles  Zacks’ Bull Of The Day: Seattle Genetics  Seattle Genetics: Continued Quality And Breadth Of Data Should Yield Investor Re-Engagement, Says H.C. Wainwright  The Dividend Diplomats Recent Buy: Johnson & Johnson (JNJ)  Sunshine Heart Inc: The Lance Armstrong Effect Setting C-Pulse Apart Sponsored See the Top 25 ranked Wall Street Analysts See the Most Recommended stocks by Top Performing Analysts Stock Spotlight: Gilead Sciences (GILD) Insider Spotlight: Phillip Frost Insider Profile: Marissa Mayer See the latest stocks recommended by 5-star Financial Analysts

Real-Time Email Alerts Choose Stocks Your Email TRY NOW!    Privacy Policy | RSS feeds | Submit Tips | About Us | Contact Us Copyright © 2015 Smarter Analyst - All Rights Reserved

theSage21 commented 8 years ago

@ploustaunau go to https://dpaste.de/ and put the html there. Then paste the link of the page here. A sample might be https://dpaste.de/eRsF

The html you provided has been cleaned by Github and so is processed without any errors.

ploustaunau commented 8 years ago

See https://dpaste.de/BE9F https://dpaste.de/BE9F

On Oct 8, 2015, at 10:09 PM, arjoonn sharma notifications@github.com wrote:

here

theSage21 commented 8 years ago

The provided html is converting without errors. This is what I did:

mkdir temp
cd temp
virtualenv env
source env/bin/activate
pip install html2text==2014.9.25
python --version
html2text --version

The commands tell me Python 2.7.6 and html2text 2014.9.25 are available now With the environment setup, the next part was this:

html2text --ignore-links --ignore-images sample.html > text

Please confirm that these steps are running fine on your machine. Note : You should consider updating the package. The current version is 2015.6.21

ploustaunau commented 8 years ago

It is not working for me. I updated to 2015.6.21, and it is still not working for me. Enclosed is the file I am trying to run through html2text, can you just try it directly to see if you get an error like I do. Thanks, Philippe

Philippe Loustaunau, Ph.D. Vista Consulting LLC 3835 9th Street N. PH1W Arlington, VA 22203

Tel: (571) 236-1427 Fax: (571) 490-8468 Email: philippe@conseil-vista.com www.conseil-vista.com

On Oct 8, 2015, at 11:19 PM, arjoonn sharma notifications@github.com wrote:

The provided html is converting without errors. This is what I did:

mkdir temp cd temp virtualenv env source env/bin/activate pip install html2text==2014.9.25 python --version html2text --version The commands tell me Python 2.7.6 and html2text 2014.9.25 are available now With the environment setup, the next part was this:

html2text --ignore-links --ignore-images sample.html > text Please confirm that these steps are running fine on your machine. Note : You should consider updating the package. The current version is 2015.6.21

— Reply to this email directly or view it on GitHub https://github.com/Alir3z4/html2text/issues/94#issuecomment-146745322.

theSage21 commented 8 years ago

@ploustaunau I am unable to recreate the error. @Alir3z4 any success? In the mean time you might find use in the --decode-errors=HANDLER command line argument as shown in the docs.

ploustaunau commented 8 years ago

I had to install the July version of the config file, with the decode_errors option, I set it to ignore, and all went well. Thank you, Philippe

Philippe Loustaunau, Ph.D. Vista Consulting LLC 3835 9th Street N. PH1W Arlington, VA 22203

Tel: (571) 236-1427 Fax: (571) 490-8468 Email: philippe@conseil-vista.com www.conseil-vista.com

On Oct 13, 2015, at 1:43 AM, arjoonn sharma notifications@github.com wrote:

@ploustaunau https://github.com/ploustaunau I am unable to recreate the error. @Alir3z4 https://github.com/Alir3z4 any success? In the mean time you might find use in the --decode-errors=HANDLER command line argument as shown in the docs https://github.com/Alir3z4/html2text/blob/master/docs/usage.md.

— Reply to this email directly or view it on GitHub https://github.com/Alir3z4/html2text/issues/94#issuecomment-147610705.

theSage21 commented 8 years ago

@ploustaunau should we consider the issue closed?

ploustaunau commented 8 years ago

Yes, thanks, Philippe

Philippe Loustaunau, Ph.D. Vista Consulting LLC 3835 9th Street N. PH1W Arlington, VA 22203

Tel: (571) 236-1427 Fax: (571) 490-8468 Email: philippe@conseil-vista.com www.conseil-vista.com

On Oct 14, 2015, at 10:36 PM, arjoonn sharma notifications@github.com wrote:

@ploustaunau https://github.com/ploustaunau should we consider the issue closed?

— Reply to this email directly or view it on GitHub https://github.com/Alir3z4/html2text/issues/94#issuecomment-148261205.

Alir3z4 commented 8 years ago

@theSage21 Thanks for the contribution on this issue.