clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.75k stars 1.58k forks source link

Error if <!DOCTYPE html> is present in HTML #223

Closed rock321987 closed 6 years ago

rock321987 commented 6 years ago

This problem can be reproduced as

from pattern3.web import document
ss='''<!DOCTYPE html><a></a>'''
aaz=Document(ss)
aaz.children

gives an error

Traceback (most recent call last): File "", line 1, in File "/home/user/anaconda3/lib/python3.6/site-packages/pattern3/web/init.py", line 3580, in getattr raise AttributeError("'Element' object has no attribute '%s'" % k) AttributeError: 'Element' object has no attribute 'children'

Updating the string to ss='''<a></a>''' do not gives error.

initbar commented 6 years ago

@rock321987 I've tried to replicate the error using Docker ubuntu:18.04 image:

root@a3506a595f72:~# uname -ar 
Linux a3506a595f72 4.14.33+ #1 SMP Sat Aug 11 08:05:16 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux

One difference is that the import entrypoint is pattern and not pattern3:

root@a3506a595f72:~# pip3 install pattern 
Requirement already satisfied: pattern in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: backports.csv in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: feedparser in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: lxml in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: pdfminer.six in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: python-docx in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: mysqlclient in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: cherrypy in /usr/local/lib/python3.6/dist-packages (from pattern)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->pattern)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->pattern)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->pattern)
Requirement already satisfied: idna<2.8,>=2.5 in /usr/lib/python3/dist-packages (from requests->pattern)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from pdfminer.six->pattern)
Requirement already satisfied: pycryptodome in /usr/local/lib/python3.6/dist-packages (from pdfminer.six->pattern)
Requirement already satisfied: sortedcontainers in /usr/local/lib/python3.6/dist-packages (from pdfminer.six->pattern)
Requirement already satisfied: singledispatch in /usr/local/lib/python3.6/dist-packages (from nltk->pattern)
Requirement already satisfied: cheroot>=6.2.4 in /usr/local/lib/python3.6/dist-packages (from cherrypy->pattern)
Requirement already satisfied: zc.lockfile in /usr/local/lib/python3.6/dist-packages (from cherrypy->pattern)
Requirement already satisfied: more-itertools in /usr/local/lib/python3.6/dist-packages (from cherrypy->pattern)
Requirement already satisfied: portend>=2.1.1 in /usr/local/lib/python3.6/dist-packages (from cherrypy->pattern)
Requirement already satisfied: backports.functools-lru-cache in /usr/local/lib/python3.6/dist-packages (from cheroot>=6.2.4->cherrypy->pattern)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from zc.lockfile->cherrypy->pattern)
Requirement already satisfied: tempora>=1.8 in /usr/local/lib/python3.6/dist-packages (from portend>=2.1.1->cherrypy->pattern)
Requirement already satisfied: jaraco.functools>=1.20 in /usr/local/lib/python3.6/dist-packages (from tempora>=1.8->portend>=2.1.1->cherrypy->pattern)
Requirement already satisfied: pytz in /usr/local/lib/python3.6/dist-packages (from tempora>=1.8->portend>=2.1.1->cherrypy->pattern)
root@a3506a595f72:~# python3
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pattern.web import Document
>>> ss='''<!DOCTYPE html><a></a>'''
>>> ss
'<!DOCTYPE html><a></a>'
>>> aaz=Document(ss)
>>> aaz.children
[Text('html'), Element(tag='html')]
>>> 

Otherwise the parse was fine without error. How did you build/install pattern module locally?

rock321987 commented 6 years ago

Yeah. You are right. The pattern3 library I used was different. I used the dev branch from pattern library and it worked for me. At the time I was using it, it wasn't available on pip.