clulab / eidos

Machine reading system for World Modelers
Apache License 2.0
36 stars 24 forks source link

Time normalization config in eidos.conf and reference.conf #443

Closed bgyori closed 5 years ago

bgyori commented 5 years ago

I've been trying to configure Eidos to use the time normalization feature and I'm running into some issues. These are 3 issues here but they are related so I'm putting them all here.

First, I am wondering if some of the differences in eidos.conf and reference.conf are on purpose or not.

  1. In reference.conf timeNormModelPath is set to

    timeNormModelPath = /org/clulab/wm/eidos/english/models/timenorm_model.hdf5                                                

    whereas in eidos.conf it is set to

    timeNormModelPath = /org/clulab/wm/eidos/models/timenorm_model.hdf5

    I think between the two, the latter is the better default setting since timenorm_model.hdf is part of the repo at org/clulab/wm/eidos/models/timenorm_model.hdf5. Should I update the default reference.conf to use this path?

  2. Another inconsistency between the two conf files is that in reference.conf

    useTimeNorm = false

    is set but there is no useTimeNorm row in eidos.conf. Would it make sense to include the same row with the same default value in eidos.conf as well?

  3. Now, using the settings as follows:

    timeNormModelPath = /org/clulab/wm/eidos/models/timenorm_model.hdf5
    ...
    useTimeNorm = true

    and running

    java -Xmx12G -cp /Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar org.clulab.wm.eidos.apps.ExtractFromDirectory /Users/ben/tmp/eidos/docs /Users/ben/tmp/eidos/docs

    I get

15:22:16.328 [scala-execution-context-global-11] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
15:22:17.500 [scala-execution-context-global-11] INFO  e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.2 sec].
jar:file:/Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar!/org/clulab/wm/eidos/models/timenorm_model.hdf5
Exception in thread "main" java.nio.file.FileSystemNotFoundException
    at com.sun.nio.zipfs.ZipFileSystemProvider.getFileSystem(ZipFileSystemProvider.java:171)
    at com.sun.nio.zipfs.ZipFileSystemProvider.getPath(ZipFileSystemProvider.java:157)
    at java.nio.file.Paths.get(Paths.java:143)
    at org.clulab.wm.eidos.EidosSystem$LoadableAttributes$.apply(EidosSystem.scala:129)
    at org.clulab.wm.eidos.EidosSystem.<init>(EidosSystem.scala:153)
    at org.clulab.wm.eidos.apps.ExtractFromDirectory$.delayedEndpoint$org$clulab$wm$eidos$apps$ExtractFromDirectory$1(ExtractFromDirectory.scala:14)
    at org.clulab.wm.eidos.apps.ExtractFromDirectory$delayedInit$body.apply(ExtractFromDirectory.scala:9)
    at scala.Function0.apply$mcV$sp(Function0.scala:34)
    at scala.Function0.apply$mcV$sp$(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App.$anonfun$main$1$adapted(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:389)
    at scala.App.main(App.scala:76)
    at scala.App.main$(App.scala:74)
    at org.clulab.wm.eidos.apps.ExtractFromDirectory$.main(ExtractFromDirectory.scala:9)
    at org.clulab.wm.eidos.apps.ExtractFromDirectory.main(ExtractFromDirectory.scala)

Note that I added a debug print: println(timeNormResource) on EidosSystem.scala line 127 to produce this line

jar:file:/Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar!/org/clulab/wm/eidos/models/timenorm_model.hdf5

I also confirmed by browsing the jar file itself that /org/clulab/wm/eidos/models/timenorm_model.hdf5 is at the specified location within the JAR.

Thanks for your help!

kwalcock commented 5 years ago

Hi Ben,

Sorry about this frustration. Whether the setting is in reference.conf or eidos.conf probably depended on how likely people thought it would need to change.plus whether it depended on language. The true/false was perhaps not likely to change, but maybe the language was. Anyway, it must have been subjective, could be improved, and you're probably correct.

The other problem is worse. The model can't be accessed from a jar file. Recent versions of sbt don't work from some local target directory where there might still be access to a file with resources, but instead from some temp directory. One can try to trick sbt by keeping the model somehow accessible, reverting to an older version of sbt (by editing project/build.properties), by using IntelliJ or Eclipse, or by changing a line of code. We hope this is fixed before a larger audience needs to run it. If need be I'll track down when sbt made that change.

In EidosSystem.scala change

val file = Paths.get(timeNormResource.toURI()).toFile().getAbsolutePath()

to

val file = "/home/you/timenorm_model.hdf5"

Just the messenger,

Keith

On Fri, Sep 21, 2018 at 12:26 PM Benjamin M. Gyori notifications@github.com wrote:

I've been trying to configure Eidos to use the time normalization feature and I'm running into some issues. These are 3 issues here but they are related so I'm putting them all here.

First, I am wondering if some of the differences in eidos.conf and reference.conf are on purpose or not.

  1. In reference.conf timeNormModelPath is set to

timeNormModelPath = /org/clulab/wm/eidos/english/models/timenorm_model.hdf5

whereas in eidos.conf it is set to

timeNormModelPath = /org/clulab/wm/eidos/models/timenorm_model.hdf5

I think between the two, the latter is the better default setting since timenorm_model.hdf is part of the repo at org/clulab/wm/eidos/models/timenorm_model.hdf5. Should I update the default reference.conf to use this path?

  1. Another inconsistency between the two conf files is that in reference.conf

useTimeNorm = false

is set but there is no useTimeNorm row in eidos.conf. Would it make sense to include the same row with the same default value in eidos.conf as well?

  1. Now, using the settings as follows:

timeNormModelPath = /org/clulab/wm/eidos/models/timenorm_model.hdf5 ... useTimeNorm = true

and running

java -Xmx12G -cp /Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar org.clulab.wm.eidos.apps.ExtractFromDirectory /Users/ben/tmp/eidos/docs /Users/ben/tmp/eidos/docs

I get

15:22:16.328 [scala-execution-context-global-11] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos 15:22:17.500 [scala-execution-context-global-11] INFO e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.2 sec]. jar:file:/Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar!/org/clulab/wm/eidos/models/timenorm_model.hdf5 Exception in thread "main" java.nio.file.FileSystemNotFoundException at com.sun.nio.zipfs.ZipFileSystemProvider.getFileSystem(ZipFileSystemProvider.java:171) at com.sun.nio.zipfs.ZipFileSystemProvider.getPath(ZipFileSystemProvider.java:157) at java.nio.file.Paths.get(Paths.java:143) at org.clulab.wm.eidos.EidosSystem$LoadableAttributes$.apply(EidosSystem.scala:129) at org.clulab.wm.eidos.EidosSystem.(EidosSystem.scala:153) at org.clulab.wm.eidos.apps.ExtractFromDirectory$.delayedEndpoint$org$clulab$wm$eidos$apps$ExtractFromDirectory$1(ExtractFromDirectory.scala:14) at org.clulab.wm.eidos.apps.ExtractFromDirectory$delayedInit$body.apply(ExtractFromDirectory.scala:9) at scala.Function0.apply$mcV$sp(Function0.scala:34) at scala.Function0.apply$mcV$sp$(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App.$anonfun$main$1$adapted(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:389) at scala.App.main(App.scala:76) at scala.App.main$(App.scala:74) at org.clulab.wm.eidos.apps.ExtractFromDirectory$.main(ExtractFromDirectory.scala:9) at org.clulab.wm.eidos.apps.ExtractFromDirectory.main(ExtractFromDirectory.scala)

Note that I added a debug print: println(timeNormResource) on EidosSystem.scala line 127 to produce this line

jar:file:/Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar!/org/clulab/wm/eidos/models/timenorm_model.hdf5

I also confirmed by browsing the jar file itself that /org/clulab/wm/eidos/models/timenorm_model.hdf5 is at the specified location within the JAR.

Thanks for your help!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/443, or mute the thread https://github.com/notifications/unsubscribe-auth/AIRxOp7AKV8zhPcJ3izyA8CX-pmzzOkAks5udT1fgaJpZM4W0vWD .

bgyori commented 5 years ago

Thanks, so if I understand correctly, if I put the hdf5 file outside the JAR and reference it by its absolute path it should work. Let me try that and I'll report back!

kwalcock commented 5 years ago

Yes, that's it. Also please be advised that we are working on related performance issues. Don't plan to run lots of files with timenorm on.

bgyori commented 5 years ago

Alright, that seems to have worked as far as the file path goes. In particular,

java -Xmx12G -cp /Users/ben/tmp/eidos/target/scala-2.12/eidos-assembly-0.2.2-SNAPSHOT.jar
 org.clulab.wm.eidos.apps.ExtractFromDirectory /Users/ben/tmp/eidos/docs /Users/ben/tmp/eidos/docs

works as expected, and timexes show up in the output JSON-LD.

However, the other reading mode we have been using, which is for reading snippets of text directly using an instance of EidosSystem and calling its extractFromText method (from Python) gives me this error:

JavaException: JVM exception occurred: Text 'nullT00:00:00' could not be parsed at index 0

Any clues what might be behind this?

kwalcock commented 5 years ago

Still working on it... The program ExtractFromDirectory uses extractFromText, so it should be working in general. Can you send a specific sentence that's a problem and/or the important part of code? Thanks. It doesn't seem that @EgoLaparra is online to respond.

bgyori commented 5 years ago

I think I have a guess: from Python I'm passing scala.Some(None) as the fourth argument which is the documentCreationTime. I thought passing in None would be adequate because None is defined as the default argument, and ExtractFromDirectory doesn't specify this argument:

val annotatedDocuments = Seq(reader.extractFromText(text))

With some experimentation, I found that if I change the argument to scala.Some('2018'), I get this error:

JavaException: JVM exception occurred: Text '2018T00:00:00' could not be parsed at index 4
EgoLaparra commented 5 years ago

For the moment, the DocTime must be in YYYY-MM-DD format. Try passing something like scala.Some('2018-09-24').

bgyori commented 5 years ago

Thanks @EgoLaparra, that worked! Let me test it some more and then I'll close this issue.

EgoLaparra commented 5 years ago

By the way, what happens if you don't pass the fourth argument?

bgyori commented 5 years ago

Complicated... The Java-Python bridge called jnius that allows us to use Eidos programatically at all is not really meant to be used with Scala. Java methods don't have default arguments (you rather define the function multiple times with different sets of arguments) and so jnius thinks this method needs 5 arguments and errors if you call it with less. This is what prompted e.g. this line: https://github.com/clulab/eidos/blob/master/src/main/scala/org/clulab/wm/eidos/EidosSystem.scala#L32

EgoLaparra commented 5 years ago

I see. In any case, we need the actual creation time of the document to get correct normalizations for expression like "last week". The parser cannot infer it from the text, so, when no DocTime is passed, it uses as reference the current date.

kwalcock commented 5 years ago

The fourth argument as in filename: Option[String]= None to EidosSystem.annotate? It should be OK. It is only used for the document id which is probably only used for the JSON-LD output.

bgyori commented 5 years ago

Well if you count from 1, not 0, then the 4th argument is documentCreationTime which we discussed above:

def extractFromText(text: String, keepText: Boolean = true, cagRelevantOnly: Boolean = true,
                      documentCreationTime: Option[String] = None, filename: Option[String] = None)
kwalcock commented 5 years ago

I can still count, but maybe it's time for trifocals :-) I didn't realize you were both talking about the same thing.

bgyori commented 5 years ago

Thanks, looks like this is working!

kwalcock commented 5 years ago

@EgoLaparra, I think you'll want to change from

def extractFromText(text: String, keepText: Boolean = true, cagRelevantOnly: Boolean = true,
                      documentCreationTime: Option[String] = None, filename: Option[String] = None)

to

def extractFromText(text: String, keepText: Boolean = true, cagRelevantOnly: Boolean = true,
                      documentCreationTime: Option[LocalDateTime] = None, filename: Option[String] = None)

Neither Eidos nor EidosDocument are in a good position to decide what kind of string is being passed and should let whatever reads or produces the string take care of that. In reading these 17k documents I find that the "creation date" comes in multiple formats and it's not efficient to parse them and convert them to the kind of string that is needed (e.g., eight digits, with dashes, without time) only to have them parsed again, etc.

EgoLaparra commented 5 years ago

What about letting the parser to deal with these strings? Eidos could pass whatever it finds, even if the format is not the correct one, and the temporal parser would decide if it can create a DCT or set it as undefined.

kwalcock commented 5 years ago

That sounds interesting. Perhaps if it is passed a string, it could convert it to an Option[LocalDateTime] and call the other function. Right now the conversion process is on the fragile side. I haven't been watching your timenorm project to know if you have made the update that includes what you want to be used in this large run. Be sure to let me know. Thanks.

kwalcock commented 5 years ago

@EgoLaparra, are we any closer on what needs to be delivered on this large run that needs to work overnight and get sent away? For the metadata files should I expect that there are some without matching text files? I need to double check, but it seemed that there were both texts without metadata and metadata without texts.

EgoLaparra commented 5 years ago

Yes, we are closer. I have changed the parser and EidosDocument so that the dct can be handled with any format, even it it is wrong. I still need to run some test to make sure that everything is working properly. And yes, the document collection in the FAO site has changed since I retrieve the pdfs, so this kind of things can happen.

EgoLaparra commented 5 years ago

@kwalcock, I have created a pull-request to kwalcock-timeTime with theses changes.

MihaiSurdeanu commented 5 years ago

Thanks @EgoLaparra and @kwalcock! This integration is very important. Please prioritize this work.