CIIR / Proteus

Million Book Project
8 stars 5 forks source link

Title words smashed together sometimes #50

Open j-allan opened 9 years ago

j-allan commented 9 years ago

Search for sequential dependence model. Note that some of the titles have words smashed together. For example, "A Quasi-Synchronous Dependence Modelfor Information Retrieval" where "Modelfor" should be "Model for". Don't know if where in the pipeline this error occurred, but it happens often enough that there's another instance in the same top 10 list.

jiepujiang commented 9 years ago

I remember it is probably a problem of the raw data caused by pdftotext recognizer application (correct me if I am wrong). It happens when there's no white space (just a line break) between two lines of texts, for example, if the title is:

A Quasi-Synchronous Dependence Model for Information Retrieval

On Fri, Jan 16, 2015 at 11:47 AM, j-allan notifications@github.com wrote:

Search for sequential dependence model. Note that some of the titles have words smashed together. For example, "A Quasi-Synchronous Dependence Modelfor Information Retrieval" where "Modelfor" should be "Model for". Don't know if where in the pipeline this error occurred, but it happens often enough that there's another instance in the same top 10 list.

— Reply to this email directly or view it on GitHub https://github.com/CIIR/Proteus/issues/50.

Jiepu Jiang

j-allan commented 9 years ago

If that's so, is it possible to hack things so that doesn't happen? Until we know that this is a bug we cannot work around, I think this issue should be left open. It could be affecting content lines, too. And what happens with hyphenated words?

On 1/16/2015 12:20 PM, Jiepu Jiang wrote:

I remember it is probably a problem of the raw data caused by pdftotext recognizer application (correct me if I am wrong). It happens when there's no white space (just a line break) between two lines of texts, for example, if the title is:

A Quasi-Synchronous Dependence Model for Information Retrieval

On Fri, Jan 16, 2015 at 11:47 AM, j-allan notifications@github.com wrote:

Search for sequential dependence model. Note that some of the titles have words smashed together. For example, "A Quasi-Synchronous Dependence Modelfor Information Retrieval" where "Modelfor" should be "Model for". Don't know if where in the pipeline this error occurred, but it happens often enough that there's another instance in the same top 10 list.

— Reply to this email directly or view it on GitHub https://github.com/CIIR/Proteus/issues/50.

Jiepu Jiang

— Reply to this email directly or view it on GitHub https://github.com/CIIR/Proteus/issues/50#issuecomment-70288120.

mzarozinski commented 9 years ago

This appears to be a bug in the process that transforms the XML output from pstotext into trectext format.

When looking at ACM article 2034732 (http://dl.acm.org/citation.cfm?id=2034732) the title is displayed as: "Efcient Keyword Extractionfor Meaningful Document Perception"

The XML from the pstotext is:

if you dump the terms from the index via (on sydney): java -jar ./target/homer-0.4-SNAPSHOT.jar doc --id=2034732 --index=/usr/lag/data2/michaelz/index/acmdl --tokenize=true | less you see that the terms are tokenized together: 17 : keyword 18 : extractionfor 19 : meaningful NOTE: this is not limited to titles, it appears to happen whenever a new is found because the first word of a line is NOT preceded by a space. Ex: