Open j-allan opened 9 years ago
I remember it is probably a problem of the raw data caused by pdftotext recognizer application (correct me if I am wrong). It happens when there's no white space (just a line break) between two lines of texts, for example, if the title is:
A Quasi-Synchronous Dependence Model for Information Retrieval
On Fri, Jan 16, 2015 at 11:47 AM, j-allan notifications@github.com wrote:
Search for sequential dependence model. Note that some of the titles have words smashed together. For example, "A Quasi-Synchronous Dependence Modelfor Information Retrieval" where "Modelfor" should be "Model for". Don't know if where in the pipeline this error occurred, but it happens often enough that there's another instance in the same top 10 list.
— Reply to this email directly or view it on GitHub https://github.com/CIIR/Proteus/issues/50.
Jiepu Jiang
If that's so, is it possible to hack things so that doesn't happen? Until we know that this is a bug we cannot work around, I think this issue should be left open. It could be affecting content lines, too. And what happens with hyphenated words?
On 1/16/2015 12:20 PM, Jiepu Jiang wrote:
I remember it is probably a problem of the raw data caused by pdftotext recognizer application (correct me if I am wrong). It happens when there's no white space (just a line break) between two lines of texts, for example, if the title is:
A Quasi-Synchronous Dependence Model for Information Retrieval
On Fri, Jan 16, 2015 at 11:47 AM, j-allan notifications@github.com wrote:
Search for sequential dependence model. Note that some of the titles have words smashed together. For example, "A Quasi-Synchronous Dependence Modelfor Information Retrieval" where "Modelfor" should be "Model for". Don't know if where in the pipeline this error occurred, but it happens often enough that there's another instance in the same top 10 list.
— Reply to this email directly or view it on GitHub https://github.com/CIIR/Proteus/issues/50.
Jiepu Jiang
— Reply to this email directly or view it on GitHub https://github.com/CIIR/Proteus/issues/50#issuecomment-70288120.
This appears to be a bug in the process that transforms the XML output from pstotext into trectext format.
When looking at ACM article 2034732 (http://dl.acm.org/citation.cfm?id=2034732) the title is displayed as: "Efcient Keyword Extractionfor Meaningful Document Perception"
The XML from the pstotext is:
Search for sequential dependence model. Note that some of the titles have words smashed together. For example, "A Quasi-Synchronous Dependence Modelfor Information Retrieval" where "Modelfor" should be "Model for". Don't know if where in the pipeline this error occurred, but it happens often enough that there's another instance in the same top 10 list.