Hi,
I've been testing Meteor with some of our documents. I found a couple documents that seem to cause the new language detection to throw an exception. Examples:
$ curl -d fileUrl=https://osuva.uwasa.fi/bitstream/handle/10024/11225/Osuva_Yli-Viitala_Arrasvuori_Wathen_2019.pdf http://127.0.0.1:5000/json
{"error":"Error while processing file https://osuva.uwasa.fi/bitstream/handle/10024/11225/Osuva_Yli-Viitala_Arrasvuori_Wathen_2019.pdf"}
$ curl -d fileUrl=https://taju.uniarts.fi/bitstream/handle/10024/5988/KOHA_Kotola_Mikkonen_Palin_artikkeli_2017.pdf http://127.0.0.1:5000/json
{"error":"Error while processing file https://taju.uniarts.fi/bitstream/handle/10024/5988/KOHA_Kotola_Mikkonen_Palin_artikkeli_2017.pdf"}
Over on the Meteor side, the output looks like this:
Traceback (most recent call last):
File "/home/xxx/git/meteor/src/util.py", line 94, in process_and_remove
results = self.meteor.run(filepath)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xxx/git/meteor/metadata_extract/meteor.py", line 29, in run
finder.extract_metadata()
File "/home/xxx/git/meteor/metadata_extract/finder.py", line 268, in extract_metadata
self.get_language()
File "/home/xxx/git/meteor/metadata_extract/finder.py", line 221, in get_language
lang = langdetect.detect(' '.join(self.doc.pages.values()))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xxx/git/meteor/venv/lib/python3.11/site-packages/langdetect/detector_factory.py", line 130, in detect
return detector.detect()
^^^^^^^^^^^^^^^^^
File "/home/xxx/git/meteor/venv/lib/python3.11/site-packages/langdetect/detector.py", line 136, in detect
probabilities = self.get_probabilities()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xxx/git/meteor/venv/lib/python3.11/site-packages/langdetect/detector.py", line 143, in get_probabilities
self._detect_block()
File "/home/xxx/git/meteor/venv/lib/python3.11/site-packages/langdetect/detector.py", line 150, in _detect_block
raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
langdetect.lang_detect_exception.LangDetectException: No features in text.
The common factor in these two documents seems to be that they don't have a text layer, so it's likely that the text extraction fails and thus the language detection algorithm has no text to work with.
I think it's OK for documents like this to fail the extraction (they're badly made PDFs!) but probably the tool should be more robust in cases like this and avoid tracebacks.
Also, the HTTP status code was still 200. I think it would be better to return an HTTP error code such as 400 Bad Request.
Hi, I've been testing Meteor with some of our documents. I found a couple documents that seem to cause the new language detection to throw an exception. Examples:
Over on the Meteor side, the output looks like this:
The common factor in these two documents seems to be that they don't have a text layer, so it's likely that the text extraction fails and thus the language detection algorithm has no text to work with.
I think it's OK for documents like this to fail the extraction (they're badly made PDFs!) but probably the tool should be more robust in cases like this and avoid tracebacks.
Also, the HTTP status code was still 200. I think it would be better to return an HTTP error code such as 400 Bad Request.