NationalLibraryOfNorway / meteor

A python module and REST API for automatic extraction of metadata from PDF files
Apache License 2.0
11 stars 2 forks source link

Language detection causes traceback with documents that don't have a text layer #7

Closed osma closed 1 year ago

osma commented 1 year ago

Hi, I've been testing Meteor with some of our documents. I found a couple documents that seem to cause the new language detection to throw an exception. Examples:

$ curl -d fileUrl=https://osuva.uwasa.fi/bitstream/handle/10024/11225/Osuva_Yli-Viitala_Arrasvuori_Wathen_2019.pdf http://127.0.0.1:5000/json
{"error":"Error while processing file https://osuva.uwasa.fi/bitstream/handle/10024/11225/Osuva_Yli-Viitala_Arrasvuori_Wathen_2019.pdf"}

$ curl -d fileUrl=https://taju.uniarts.fi/bitstream/handle/10024/5988/KOHA_Kotola_Mikkonen_Palin_artikkeli_2017.pdf http://127.0.0.1:5000/json
{"error":"Error while processing file https://taju.uniarts.fi/bitstream/handle/10024/5988/KOHA_Kotola_Mikkonen_Palin_artikkeli_2017.pdf"}

Over on the Meteor side, the output looks like this:

Traceback (most recent call last):
  File "/home/xxx/git/meteor/src/util.py", line 94, in process_and_remove
    results = self.meteor.run(filepath)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/git/meteor/metadata_extract/meteor.py", line 29, in run
    finder.extract_metadata()
  File "/home/xxx/git/meteor/metadata_extract/finder.py", line 268, in extract_metadata
    self.get_language()
  File "/home/xxx/git/meteor/metadata_extract/finder.py", line 221, in get_language
    lang = langdetect.detect(' '.join(self.doc.pages.values()))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/git/meteor/venv/lib/python3.11/site-packages/langdetect/detector_factory.py", line 130, in detect
    return detector.detect()
           ^^^^^^^^^^^^^^^^^
  File "/home/xxx/git/meteor/venv/lib/python3.11/site-packages/langdetect/detector.py", line 136, in detect
    probabilities = self.get_probabilities()
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/git/meteor/venv/lib/python3.11/site-packages/langdetect/detector.py", line 143, in get_probabilities
    self._detect_block()
  File "/home/xxx/git/meteor/venv/lib/python3.11/site-packages/langdetect/detector.py", line 150, in _detect_block
    raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
langdetect.lang_detect_exception.LangDetectException: No features in text.

The common factor in these two documents seems to be that they don't have a text layer, so it's likely that the text extraction fails and thus the language detection algorithm has no text to work with.

I think it's OK for documents like this to fail the extraction (they're badly made PDFs!) but probably the tool should be more robust in cases like this and avoid tracebacks.

Also, the HTTP status code was still 200. I think it would be better to return an HTTP error code such as 400 Bad Request.