HTML imporovements - Githubissues

seanmacavaney commented 3 years ago

Right now HTML content for datasets (e.g., clueweb09, clueweb12, and the forthcoming cc-news-en #63) is meant to be handled by HtmlDocExtractor. I'll write up and discuss specific proposals later, but for now, here's a bunch of problems with the current setup:

Ease of use

All the tools that use ir_datasets so far (e.g., opennir, pyterrier, capreolus, diffir) interface with it just using dataset IDs, and do not provide an easy way to apply a wrapper. A wrapper is necessary for it to work with models that take text data.

Ideally: One would be able to get the raw text of the document by providing only a dataset ID. (Potentially structured, below.) At the same time, we do not want to pollute the dataset ID namespace with a ton of versions.

Efficiency and multiprocessing

The implementation uses BS4, which ends up being pretty slow. As a result, it uses multiprocessing, but this is not ideal as we do not know what application is using it and how much memory will need to be forked. It's probably best to stick with multithreading instead, if possible.

BS4 is nice because it handles a bunch of HTML quirks. A replacement should be able to do the same reasonably well.

Ideally: Documents would be processed with minimal overhead, without falling back on multiprocessing.

Document structure

HtmlDocExtractor simply extracts the raw text. As mentioned in #63, it would be nice to be able to extract more structure.

Ideally: One could easily access the structure of the document, which could be helpful with segmentation and such.

seanmacavaney commented 3 years ago

This is very much related to #72

seanmacavaney commented 3 years ago

Re-visit pyterrier examples (#73) when this change is made.

seanmacavaney commented 3 years ago

tl;dr A 27.7x speedup by switching parsers, down to 3ms/doc on my machine. This seems /okay/ to do on-demand in a property, as planned for #72, but if we could find a way to speed up further, it would be better.

The bs4 parser is much slower than using lxml.etree.HTMLParser:

Average over 3 runs of extracting text from the first 200 clueweb09/en documents:

bs4: 16.593s (83ms/doc)
lxml: 0.680s (3ms/doc)

bs4 used an adaptation of the current implementation:

def bs4_parser(body):
    ignore = {'[document]', 'noscript', 'header', 'html', 'meta', 'head', 'input', 'script', 'style'}
    bs4 = ir_datasets.lazy_libs.bs4()
    soup = bs4.BeautifulSoup(body, 'html.parser')
    output = ''
    for t in soup.find_all(text=True):
        if t.parent.name not in ignore and not isinstance(t, bs4.element.Comment):
            output += '{} '.format(t)
    return output

lxml used the following (which should be cleaned up and adjusted to handle the specified encoding, if provided):

def html_parser(body):
  parser = etree.HTMLParser()
  tree = etree.parse(BytesIO(body), parser)
  IGNORE_TAGS = {'[document]', 'noscript', 'header', 'html', 'meta', 'head', 'input', 'script', 'style'}
  def x(a):
    if a.tag in IGNORE_TAGS or isinstance(a, etree._Comment):
      text, tail = None, None
    else:
      try:
        text = a.text
      except UnicodeError:
        text = ''
      try:
        tail = a.tail
      except UnicodeError:
        tail = ''
    seq = [text] + [x(b) for b in a] + [tail]
    return ' '.join([s for s in seq if s])
  result = x(tree.getroot())
  return result

Can this be done even faster using lxml.sax?

When ignoring whitespace, the results on the first 200 documents were identical from the two implementations.

seanmacavaney commented 3 years ago

RE: encoding-- lxml's documentation has a suggestion there:

However, note that the most common problem with web pages is the lack of (or the existence of incorrect) encoding declarations. It is therefore often sufficient to only use the encoding detection of BeautifulSoup, called UnicodeDammit, and to leave the rest to lxml's own HTML parser, which is several times faster.

seanmacavaney commented 3 years ago

bs4 gets much, much faster when cchardet is installed. Still, it's faster to bypass UnicodeDammit and use cchardet directly.

Latest timings (all handling encoding)

bs4: 2.77s (13ms/doc)
lxml etree: 404ms (2.0ms/doc)
sax: 344ms (1.7ms/doc)

As suspected, the sax parser ended up being faster, though not by as much as I was hoping. Maybe there's a way to speed it up further?

def sax_parser(body, did):
  sax = MySax()
  parser = etree.HTMLParser(target=sax)
  encoding = chardet.detect(body)['encoding'] or 'utf8'
  cdc = codecs.lookup(encoding)
  while body:
    text, count = cdc.decode(body, 'ignore')
    parser.feed(text)
    body = body[count:]
  parser.close()
  return str(sax)

class MySax:
  IGNORE_TAGS = {'noscript', 'meta', 'input', 'script', 'style'}
  def __init__(self):
    self.text = io.StringIO()
    self.ignore_tag_stack = []
  def __str__(self):
    self.text.seek(0)
    return self.text.read()
  def data(self, data):
    if not self.ignore_tag_stack:
      self.text.write(data)
  def start(self, tag, attrs):
    tag = tag.lower()
    if tag in self.IGNORE_TAGS:
      self.ignore_tag_stack.append(tag)
  def end(self, tag):
    tag = tag.lower()
    if tag in self.IGNORE_TAGS:
      while self.ignore_tag_stack and self.ignore_tag_stack.pop() != tag:
        pass
  def close(self):
    pass
  def comment(self, data):
    pass
  def doctype(self, *args):
    pass
  def pi(self, *args):
    pass

seanmacavaney commented 2 years ago

addressed with #173

allenai / ir_datasets

HTML imporovements #64