Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.86k stars 283 forks source link

Python 3.5 support #98

Closed Alir3z4 closed 9 years ago

Alir3z4 commented 9 years ago

The first attempt on running the test on python 3.5 version has failed. https://travis-ci.org/Alir3z4/html2text/jobs/89231913

No hurries for Python 3.5 version for now, but the failures are cool and weird.

Using worker: worker-linux-f25aadab-1.bb.travis-ci.org:travis-linux-3

Build system information
Build language: python
Build image provisioning date and time
Wed Feb  4 18:22:50 UTC 2015
Operating System Details
Distributor ID: Ubuntu
Description:    Ubuntu 12.04 LTS
Release:    12.04
Codename:   precise
Linux Version
2.6.32-042stab090.5
Cookbooks Version
23bb455 https://github.com/travis-ci/travis-cookbooks/tree/23bb455
GCC version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

LLVM version
clang version 3.4 (tags/RELEASE_34/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
Pre-installed Ruby versions
ruby-1.9.3-p551
Pre-installed Node.js versions
v0.10.36
Pre-installed Go versions
1.4.1
Redis version
redis-server 2.8.19
riak version
2.0.2
MongoDB version
MongoDB 2.4.12
CouchDB version
couchdb 1.6.1
Neo4j version
1.9.4
Cassandra version
2.0.9
RabbitMQ Version
3.4.3
ElasticSearch version
1.4.0
Installed Sphinx versions
2.0.10
2.1.9
2.2.6
Default Sphinx version
2.2.6
Installed Firefox version
firefox 31.0esr
PhantomJS version
1.9.8
ant -version
Apache Ant(TM) version 1.8.2 compiled on December 3 2011
mvn -version
Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T17:29:23+00:00)
Maven home: /usr/local/maven
Java version: 1.7.0_76, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-7-oracle/jre
Default locale: en, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-042stab090.5", arch: "amd64", family: "unix"

3.5 is not installed; attempting download
$ git clone --depth=50 --branch=master https://github.com/Alir3z4/html2text.git Alir3z4/html2text
Cloning into 'Alir3z4/html2text'...
remote: Counting objects: 754, done.
remote: Compressing objects: 100% (329/329), done.
remote: Total 754 (delta 454), reused 693 (delta 404), pack-reused 0
Receiving objects: 100% (754/754), 128.28 KiB | 0 bytes/s, done.
Resolving deltas: 100% (454/454), done.
Checking connectivity... done.
$ cd Alir3z4/html2text
$ git checkout -qf d62e3e36fee59682a02acc67e406012fc4186db7
$ source ~/virtualenv/python3.5/bin/activate
$ python --version
Python 3.5.0
$ pip --version
pip 7.1.2 from /home/travis/virtualenv/python3.5.0/lib/python3.5/site-packages (python 3.5)
$ pip install coveralls==0.5
Collecting coveralls==0.5
  Downloading coveralls-0.5.zip
Collecting PyYAML>=3.10 (from coveralls==0.5)
  Downloading PyYAML-3.11.tar.gz (248kB)
    100% |████████████████████████████████| 249kB 306kB/s 
Collecting docopt>=0.6.1 (from coveralls==0.5)
  Downloading docopt-0.6.2.tar.gz
Collecting coverage<3.999,>=3.6 (from coveralls==0.5)
  Downloading coverage-3.7.1.tar.gz (284kB)
    100% |████████████████████████████████| 286kB 680kB/s 
Collecting requests>=1.0.0 (from coveralls==0.5)
  Downloading requests-2.8.1-py2.py3-none-any.whl (497kB)
    100% |████████████████████████████████| 499kB 747kB/s 
Building wheels for collected packages: coveralls, PyYAML, docopt, coverage
  Running setup.py bdist_wheel for coveralls
  Stored in directory: /home/travis/.cache/pip/wheels/f5/ab/25/267e79de52ac71d17c371d4cca6b51d8c7b3a686cc5e0413ec
  Running setup.py bdist_wheel for PyYAML
  Stored in directory: /home/travis/.cache/pip/wheels/fa/db/f6/dee55793d344f1706dc4a5a693298f0115241d1085cc212364
  Running setup.py bdist_wheel for docopt
  Stored in directory: /home/travis/.cache/pip/wheels/0d/5c/a7/cb986749520c1950217b5d8405def5c18541322dbc411a80d1
  Running setup.py bdist_wheel for coverage
  Stored in directory: /home/travis/.cache/pip/wheels/1c/2d/55/470890618558cacad65599407445c2a1636222bd896c9023f3
Successfully built coveralls PyYAML docopt coverage
Installing collected packages: PyYAML, docopt, coverage, requests, coveralls
Successfully installed PyYAML-3.11 coverage-3.7.1 coveralls-0.5 docopt-0.6.2 requests-2.8.1
$ [ "${TRAVIS_PYTHON_VERSION}" = "2.6" ] && pip install --use-mirrors unittest2 || /bin/true
$ export COVERAGE_PROCESS_START=$PWD/.coveragerc
$ PYTHONPATH=$PYTHONPATH:. coverage run --source=html2text --rcfile=.coveragerc setup.py test -v
running test
......................FF....FFFF......FFFF.........F.........F..........................
======================================================================
FAIL: test_emdash-para_cmd (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 106, in test_cmd
    self.assertEqual(result, actual)
AssertionError: 'Baco[257 chars]shank--\n\n--irure ex esse id, ham commodo mea[476 chars]\n\n' != 'Baco[257 chars]shank—\n\n—irure ex esse id, ham commodo meatl[474 chars]\n\n'
  Bacon ipsum dolor sit amet pork chop id pork belly ham hock, sed meatloaf eu
  exercitation flank quis veniam officia. Chuck dolor esse, occaecat est elit
  drumstick ground round tri-tip nisi. Eu fugiat drumstick leberkas magna.
- Turducken frankfurter nisi aute shank--
?                                      ^^
+ Turducken frankfurter nisi aute shank—
?                                      ^

- --irure ex esse id, ham commodo meatloaf pig pariatur ut cow. Officia salami
? ^^
+ —irure ex esse id, ham commodo meatloaf pig pariatur ut cow. Officia salami in
? ^                                                                          +++
- in fatback voluptate boudin ullamco beef ribs shank. Duis spare ribs pork
? ---
+ fatback voluptate boudin ullamco beef ribs shank. Duis spare ribs pork chop,
?                                                                       ++++++
- chop, ad leberkas reprehenderit id voluptate salami ham ut in ut cillum
? ------
+ ad leberkas reprehenderit id voluptate salami ham ut in ut cillum turducken.
?                                                                  +++++++++++
- turducken. Nisi ribeye tail capicola dolore andouille. Short ribs id beef
? -----------
+ Nisi ribeye tail capicola dolore andouille. Short ribs id beef ribs, et nulla
?                                                               +++++++++++++++
- ribs, et nulla ground round do sunt dolore. Dolore nisi ullamco veniam sunt.
? ---------------
+ ground round do sunt dolore. Dolore nisi ullamco veniam sunt. Duis brisket
?                                                              +++++++++++++
- Duis brisket drumstick, dolor fatback filet mignon meatloaf laboris tri-tip
? -------------
+ drumstick, dolor fatback filet mignon meatloaf laboris tri-tip speck chuck
?                                                               ++++++++++++
- speck chuck ball tip voluptate ullamco laborum.
? ------------
+ ball tip voluptate ullamco laborum.

  \--

======================================================================
FAIL: test_emdash-para_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: 'Baco[257 chars]shank--\n\n--irure ex esse id, ham commodo mea[476 chars]\n\n' != 'Baco[257 chars]shank—\n\n—irure ex esse id, ham commodo meatl[474 chars]\n\n'
  Bacon ipsum dolor sit amet pork chop id pork belly ham hock, sed meatloaf eu
  exercitation flank quis veniam officia. Chuck dolor esse, occaecat est elit
  drumstick ground round tri-tip nisi. Eu fugiat drumstick leberkas magna.
- Turducken frankfurter nisi aute shank--
?                                      ^^
+ Turducken frankfurter nisi aute shank—
?                                      ^

- --irure ex esse id, ham commodo meatloaf pig pariatur ut cow. Officia salami
? ^^
+ —irure ex esse id, ham commodo meatloaf pig pariatur ut cow. Officia salami in
? ^                                                                          +++
- in fatback voluptate boudin ullamco beef ribs shank. Duis spare ribs pork
? ---
+ fatback voluptate boudin ullamco beef ribs shank. Duis spare ribs pork chop,
?                                                                       ++++++
- chop, ad leberkas reprehenderit id voluptate salami ham ut in ut cillum
? ------
+ ad leberkas reprehenderit id voluptate salami ham ut in ut cillum turducken.
?                                                                  +++++++++++
- turducken. Nisi ribeye tail capicola dolore andouille. Short ribs id beef
? -----------
+ Nisi ribeye tail capicola dolore andouille. Short ribs id beef ribs, et nulla
?                                                               +++++++++++++++
- ribs, et nulla ground round do sunt dolore. Dolore nisi ullamco veniam sunt.
? ---------------
+ ground round do sunt dolore. Dolore nisi ullamco veniam sunt. Duis brisket
?                                                              +++++++++++++
- Duis brisket drumstick, dolor fatback filet mignon meatloaf laboris tri-tip
? -------------
+ drumstick, dolor fatback filet mignon meatloaf laboris tri-tip speck chuck
?                                                               ++++++++++++
- speck chuck ball tip voluptate ullamco laborum.
? ------------
+ ball tip voluptate ullamco laborum.

  \--

======================================================================
FAIL: test_googledocmassdownload_cmd (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 106, in test_cmd
    self.assertEqual(result, actual)
AssertionError: "#  t[237 chars]n**   being\n  3. end  \n  \n**bold**   \n_ita[156 chars]_ \n" != "#  t[237 chars]n**  being\n  3. end  \n  \n**bold**   \n_ital[146 chars]_ \n"
  #  test doc  

  first issue  

    - bit
    - _**bold italic**_ 
      - orange
      - apple
    - final  

  text to separate lists  

    1. now with numbers
    2. the prisoner
      1. not an  _italic number_ 
-     2. a  **bold human**   being
?                           -
+     2. a  **bold human**  being
    3. end  

  **bold**   
  _italic_   

  ` def func(x):`  
- `   if x < 1:`  
?  --
+ ` if x < 1:`  
- `     return 'a'`  
?  ----
+ ` return 'a'`  
- `   return 'b'`  
?  --
+ ` return 'b'`  

- Some  ` fixed width text`  here  
?                           -
+ Some  ` fixed width text` here  
  _` italic fixed width text`_ 

======================================================================
FAIL: test_googledocmassdownload_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: "#  t[237 chars]n**   being\n  3. end  \n  \n**bold**   \n_ita[156 chars]_ \n" != "#  t[237 chars]n**  being\n  3. end  \n  \n**bold**   \n_ital[146 chars]_ \n"
  #  test doc  

  first issue  

    - bit
    - _**bold italic**_ 
      - orange
      - apple
    - final  

  text to separate lists  

    1. now with numbers
    2. the prisoner
      1. not an  _italic number_ 
-     2. a  **bold human**   being
?                           -
+     2. a  **bold human**  being
    3. end  

  **bold**   
  _italic_   

  ` def func(x):`  
- `   if x < 1:`  
?  --
+ ` if x < 1:`  
- `     return 'a'`  
?  ----
+ ` return 'a'`  
- `   return 'b'`  
?  --
+ ` return 'b'`  

- Some  ` fixed width text`  here  
?                           -
+ Some  ` fixed width text` here  
  _` italic fixed width text`_ 

======================================================================
FAIL: test_googledocsaved_cmd (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 106, in test_cmd
    self.assertEqual(result, actual)
AssertionError: "#  t[237 chars]n**   being\n  3. end  \n  \n**bold**   \n_ita[156 chars]_ \n" != "#  t[237 chars]n**  being\n  3. end  \n  \n**bold**   \n_ital[146 chars]_ \n"
  #  test doc  

  first issue  

    - bit
    - _**bold italic**_ 
      - orange
      - apple
    - final  

  text to separate lists  

    1. now with numbers
    2. the prisoner
      1. not an  _italic number_ 
-     2. a  **bold human**   being
?                           -
+     2. a  **bold human**  being
    3. end  

  **bold**   
  _italic_   

  ` def func(x):`  
- `   if x < 1:`  
?  --
+ ` if x < 1:`  
- `     return 'a'`  
?  ----
+ ` return 'a'`  
- `   return 'b'`  
?  --
+ ` return 'b'`  

- Some  ` fixed width text`  here  
?                           -
+ Some  ` fixed width text` here  
  _` italic fixed width text`_ 

======================================================================
FAIL: test_googledocsaved_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: "#  t[237 chars]n**   being\n  3. end  \n  \n**bold**   \n_ita[156 chars]_ \n" != "#  t[237 chars]n**  being\n  3. end  \n  \n**bold**   \n_ital[146 chars]_ \n"
  #  test doc  

  first issue  

    - bit
    - _**bold italic**_ 
      - orange
      - apple
    - final  

  text to separate lists  

    1. now with numbers
    2. the prisoner
      1. not an  _italic number_ 
-     2. a  **bold human**   being
?                           -
+     2. a  **bold human**  being
    3. end  

  **bold**   
  _italic_   

  ` def func(x):`  
- `   if x < 1:`  
?  --
+ ` if x < 1:`  
- `     return 'a'`  
?  ----
+ ` return 'a'`  
- `   return 'b'`  
?  --
+ ` return 'b'`  

- Some  ` fixed width text`  here  
?                           -
+ Some  ` fixed width text` here  
  _` italic fixed width text`_ 

======================================================================
FAIL: test_html-escaping_cmd (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 106, in test_cmd
    self.assertEqual(result, actual)
AssertionError: 'Escaped HTML like &lt;div&gt; or &amp; should remain escape[100 chars]\n\n' != 'Escaped HTML like <div> or & should remain escaped on outpu[90 chars]\n\n'
- Escaped HTML like &lt;div&gt; or &amp; should remain escaped on output
?                   ^^^^   ^^^^     ----
+ Escaped HTML like <div> or & should remain escaped on output
?                   ^   ^

      ...unless that escaped HTML is in a <pre> tag

  `...or a <code> tag`

======================================================================
FAIL: test_html-escaping_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: 'Escaped HTML like &lt;div&gt; or &amp; should remain escape[100 chars]\n\n' != 'Escaped HTML like <div> or & should remain escaped on outpu[90 chars]\n\n'
- Escaped HTML like &lt;div&gt; or &amp; should remain escaped on output
?                   ^^^^   ^^^^     ----
+ Escaped HTML like <div> or & should remain escaped on output
?                   ^   ^

      ...unless that escaped HTML is in a <pre> tag

  `...or a <code> tag`

======================================================================
FAIL: test_html_entities_out_of_text_cmd (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 106, in test_cmd
    self.assertEqual(result, actual)
AssertionError: '[allas: Country Manager](http://thth)\n\n' != '[állás: Country Manager](http://thth)\n\n'
- [allas: Country Manager](http://thth)
?  ^  ^
+ [állás: Country Manager](http://thth)
?  ^  ^

======================================================================
FAIL: test_html_entities_out_of_text_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: '[allas: Country Manager](http://thth)\n\n' != '[állás: Country Manager](http://thth)\n\n'
- [allas: Country Manager](http://thth)
?  ^  ^
+ [állás: Country Manager](http://thth)
?  ^  ^

======================================================================
FAIL: test_invalid_unicode_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: 'Br\n\n' != 'B�r\n\n'
- Br
+ B�r
?  +

======================================================================
FAIL: test_nbsp_unicode_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/Alir3z4/html2text/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: '# NB[182 chars]ed do\xa0eiusmod\ntempor incididunt ut\xa0labo[385 chars]\n\n' != '# NB[182 chars]ed do eiusmod\ntempor incididunt ut labore et [349 chars]\n\n'
  # NBSP handling test #2

  In this test all NBSPs will be replaced with unicode non-breaking spaces
  (unicode_snob = True).

- Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
?                                                                 ^
+ Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
?                                                                 ^
- tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
?                     ^         ^                       ^       ^
+ tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
?                     ^         ^                       ^       ^
- quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
?                                                  ^          ^
+ quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
?                                                  ^          ^
  consequat.

- Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
?                         ^                ^
+ Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
?                         ^                ^
- eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
?   ^
+ eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
?   ^
- in culpa qui officia deserunt mollit anim id est laborum.
?   ^                                         ^
+ in culpa qui officia deserunt mollit anim id est laborum.
?   ^                                         ^

----------------------------------------------------------------------
Ran 88 tests in 11.903s

FAILED (failures=12)

The command "PYTHONPATH=$PYTHONPATH:. coverage run --source=html2text --rcfile=.coveragerc setup.py test -v" exited with 1.
$ coverage combine

The command "coverage combine" exited with 0.
$ coverage report
Name                 Stmts   Miss Branch BrMiss  Cover
------------------------------------------------------
html2text/__init__     573     45    337     27    92%
html2text/cli           72      9      6      1    87%
html2text/compat        10      4      2      1    58%
html2text/config        33      0      0      0   100%
html2text/utils        103      4     54      2    96%
------------------------------------------------------
TOTAL                  791     62    399     31    92%

The command "coverage report" exited with 0.

Done. Your build exited with 1.
theSage21 commented 9 years ago

@Alir3z4 These failing tests can be broken down into:

  1. spaces and "\xa0" are interchanged.
    • nbsp_unicode_md has "\xa0"
  2. invalid_unicode_md allows invalid md to go through
  3. html_entities_out_of_text does not convert állás to allas
  4. html-escaping does not work for <, > and &
  5. Googledoc_saved, google_doc_mass_download
    • Extra space after \ in googledoc_saved
    • extra spaces after `
  6. emdash-para
    • one less - at the end
    • strange wrapping

The first issue is easily handled. It is the character \xa0 instead of a normal space.

theSage21 commented 9 years ago

@Alir3z4 Just noticed that this is a duplicate of https://github.com/Alir3z4/html2text/issues/89. It is funny that this is https://github.com/Alir3z4/html2text/issues/98 and that is https://github.com/Alir3z4/html2text/issues/89. Can you close this one please? I will continue working on that since that was reported earlier.

Alir3z4 commented 9 years ago

@theSage21 Good catch, it's interesting. It's a conspiracy theory, Illuminati alert :D

The issue is closed and marked as duplicate of #89