CenterForOpenScience / pydocx

An extendable docx file format parser and converter
Other
183 stars 55 forks source link

`ValueError: invalid literal for int() with base 10:` crash calling `start_margin_position` #220

Closed winhamwr closed 7 years ago

winhamwr commented 7 years ago

Some .docx files that we're seeing in the wild are using a non-integer value for the indentation_hanging attribute. It looks something like: <ind hanging="504.00000000000006">

This causes a crash when we try to cast to int.

I'm not sure if these documents are violating the spec, but we've seen 123 cases of this in the last 8 days.

Traceback

Traceback (most recent call last):
  File "/home/policystat/env/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/policystat/env/lib/python2.7/site-packages/newrelic-2.58.1.44/newrelic/hooks/application_celery.py", line 66, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/policystat/env/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/policystat/env/lib/python2.7/site-packages/jobtastic/task.py", line 362, in run
    task_result = self.calculate_result(*args, **kwargs)
  File "/opt/pstat/versions/20160726-154201/pstat/core/import_translation/tasks/__init__.py", line 151, in calculate_result
    self.error_handler(e, *args, **kwargs)
  File "/opt/pstat/versions/20160726-154201/pstat/core/implementation/tasks.py", line 66, in error_handler
    super(BuildImport, self).error_handler(e, *args, **kwargs)
  File "/opt/pstat/versions/20160726-154201/pstat/core/import_translation/tasks/__init__.py", line 133, in calculate_result
    result_dict = self.extract_html(*args, **kwargs)
  File "/opt/pstat/versions/20160726-154201/pstat/core/implementation/tasks.py", line 59, in extract_html
    force_nodupe=True,
  File "/opt/pstat/versions/20160726-154201/pstat/document_import/models.py", line 1773, in build_import
    document_import = self._build_import(category, user, force_nodupe, tenant)
  File "/opt/pstat/versions/20160726-154201/pstat/document_import/models.py", line 1684, in _build_import
    metadata=metadata,
  File "/opt/pstat/versions/20160726-154201/pstat/core/import_translation/translation.py", line 711, in get_html_from_file
    html = exporter.export()
  File "/opt/pstat/versions/20160726-154201/pstat/core/import_translation/translation.py", line 151, in export
    html = super(DocxToHTMLExporter, self).export()
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/html.py", line 211, in export
    for result in super(PyDocXHTMLExporter, self).export()
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/html.py", line 209, in <genexpr>
    result.to_html() if isinstance(result, HtmlTag)
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/base.py", line 123, in export
    for result in self.export_node(document):
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/base.py", line 218, in export_node
    for result in results:
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/html.py", line 127, in apply
    for result in results:
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/base.py", line 218, in export_node
    for result in results:
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/html.py", line 127, in apply
    for result in results:
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/base.py", line 251, in yield_nested
    for item in iterable:
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/base.py", line 290, in yield_numbering_spans
    numbering_spans = builder.get_numbering_spans()
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/numbering_span.py", line 415, in get_numbering_spans
    new_items.extend(self.process_component(index, component))
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/numbering_span.py", line 399, in process_component
    for new_component in self.handle_paragraph(index, component):
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/numbering_span.py", line 362, in handle_paragraph
    level = self.get_numbering_level(paragraph)
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/util/memoize.py", line 31, in __call__
    value = self.func(*args)
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/numbering_span.py", line 476, in get_numbering_level
    return self.detect_faked_list(paragraph)
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/numbering_span.py", line 563, in detect_faked_list
    left_position = self.get_left_position_for_paragraph(paragraph)
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/util/memoize.py", line 31, in __call__
    value = self.func(*args)
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/export/numbering_span.py", line 516, in get_left_position_for_paragraph
    left_position = properties.start_margin_position
  File "/home/policystat/env/lib/python2.7/site-packages/pydocx/openxml/wordprocessing/paragraph_properties.py", line 40, in start_margin_position
    start_margin -= int(self.indentation_hanging)
ValueError: invalid literal for int() with base 10: '504.00000000000006'
kylegibson commented 7 years ago

How does the spec say we should be handling these?

jlward commented 7 years ago

How does the spec say we should be handling these?

I didn't read the spec.

kylegibson commented 7 years ago

Why not? How then do you know that we are actually "correctly" handling margin positions? Simply not erroring is not sufficient.

jlward commented 7 years ago

Why not? How then do you know that we are actually "correctly" handling margin positions? Simply not erroring is not sufficient.

I can switch the update note to say that we are no longer erroring.

kylegibson commented 7 years ago

I can switch the update note to say that we are no longer erroring.

That is preferred over the current note.

winhamwr commented 7 years ago

How does the spec say we should be handling these?

What's the website you gentlemen use to browse the spec? Our documentation only links to the spec itself, which is pretty horrible reading. It would be great to have a link to that spec browser site in the contributtion docs.

I didn't read the spec.

I have the same question as Kyle. Why not? Pydocx is based on the spec. If the spec says to ignore non-conforming attributes, rather than truncate them, that's just as easy for us to do.

jlward commented 7 years ago

What's the website you gentlemen use to browse the spec? Our documentation only links to the spec itself, which is pretty horrible reading. It would be great to have a link to that spec browser site in the contributtion docs.

Honestly, I don't normally read the spec (for better or for worse).

I have the same question as Kyle. Why not? Pydocx is based on the spec. If the spec says to ignore non-conforming attributes, rather than truncate them, that's just as easy for us to do.

Because I don't want to spend an hour to find and figure out what the spec says to maybe save one pixel.

kylegibson commented 7 years ago

What's the website you gentlemen use to browse the spec? Our documentation only links to the spec itself, which is pretty horrible reading.

I don't use any website. I always go to the PDF. It's the most authoritative source I know of, and it is very detailed. I don't think I've ever read a spec or reference that wasn't horrible reading.

winhamwr commented 7 years ago

Because I don't want to spend an hour to find and figure out what the spec says to maybe save one pixel.

The job of pydocx is to do that so that every future user doesn't have to worry about things like this. Have you tried and failed to find what you're looking for or just not tried?

jlward commented 7 years ago

The job of pydocx is to do that so that every future user doesn't have to worry about things like this. Have you tried and failed to find what you're looking for or just not tried?

I haven't tried. If Kyle could link me to the PDF, I will timebox looking for this for an hour.

kylegibson commented 7 years ago

It's linked here: https://pydocx.readthedocs.io/en/latest/conformance.html

You'll need to download Part 1 of the fourth edition. The PDF is contained within the zip.

winhamwr commented 7 years ago

I will timebox looking for this for an hour.

That seems like a good timebox use, to me.

jlward commented 7 years ago

screenshot from 2016-07-29 15-30-43

jlward commented 7 years ago

This tells me that if this number is off by one, it'll be off by less 1/1440th of an inch. Which doesn't seem significant to me. Aside from that, I can't really tell if this should end up being an int or a float

winhamwr commented 7 years ago

Looks good!

jlward commented 7 years ago

@kylegibson care to weigh in? You are much more familiar with reading and understanding the spec for this.

Simplici commented 6 years ago

it's a pity that in github we do not see the release 0.9.10 which has solved this bug.

kylegibson commented 6 years ago

@Simplici - Just created a release, thank you! https://github.com/CenterForOpenScience/pydocx/releases