clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Problem processing relatively small NXML evaluation file #317

Closed hickst closed 8 years ago

hickst commented 8 years ago

I tried to process a relatively small evaluation file (twice) on Jenny and 6 hours later it had not finished (by comparison the first 99 evaluation files finished in 30 minutes). There is something odd about this file that we need to investigate and figure out what's going on with our system. I was using 64G of memory and 24 threads on Jenny.

The file is PMC4674766.nxml located on Jenny at: /net/kate/storage/work/hickst/Documents/reach/nxml.last1

hickst commented 8 years ago

This appears to be an Assembly issue. I set 'withAssembly = false' in the application.conf file and the file processed successfully in 3 minutes on my machine. On Jenny, it finished successfully in 2 minutes.

myedibleenso commented 8 years ago

Weird. I ran everything until output in a reasonable amount of time (<=5 min. on my laptop with 5G allocated). Are you sure that's the right file?

myedibleenso commented 8 years ago
import org.clulab.reach.PaperReader
import org.clulab.reach.darpa.OutputDegrader
import org.clulab.odin.Mention
import org.clulab.assembly.Assembler

def runPaper(f: File): (Seq[Mention], Assembler) = {
  val mentions = PaperReader.getMentionsFromPaper(f)
  val fm = OutputDegrader.prepareForOutput(mentions)
  val assembler = new Assembler(flattenedMentions)
  (fm, assembler)
}

val paper = new File("/path/to/problem/file/PMC4674766.nxml")
val (fm, assembler) = runPaper(paper)
myedibleenso commented 8 years ago

@hickst, I added a debugging line to see if/when assembly finishes (immediately after line 152 in ReachCLI). It seems to hang when hitting the outputter.

myedibleenso commented 8 years ago
featureFactory=org.clulab.processors.corenlp.chunker.ChunkingFeatureFactory
  0s: PMC4674766: starting reading
  164s: PMC4674766: finished reading
06:25:17.866 [ForkJoinPool-1-worker-7] DEBUG org.clulab.assembly.Assembler - Finding valid mentions...
06:25:17.887 [ForkJoinPool-1-worker-7] DEBUG org.clulab.assembly.Assembler - Applying sieves...
06:25:17.903 [ForkJoinPool-1-worker-7] DEBUG o.c.a.sieves.DeduplicationSieves -     applying trackMentions sieve...
06:25:18.046 [ForkJoinPool-1-worker-7] DEBUG o.c.assembly.sieves.PrecedenceSieves -     applying withinRbPrecedence sieve...
06:25:18.969 [ForkJoinPool-1-worker-7] DEBUG o.c.assembly.sieves.PrecedenceSieves -     applying reichenbachPrecedence sieve...
06:25:20.424 [ForkJoinPool-1-worker-7] DEBUG o.c.assembly.sieves.PrecedenceSieves -     applying betweenRbPrecedence sieve...
06:25:20.662 [ForkJoinPool-1-worker-7] DEBUG o.c.assembly.sieves.PrecedenceSieves -     applying featureBasedClassifier sieve...
06:28:18.445 [ForkJoinPool-1-worker-7] DEBUG org.clulab.assembly.Assembler - Building Map of Causal Predecessors...
06:28:18.479 [ForkJoinPool-1-worker-7] DEBUG org.clulab.assembly.Assembler - Building Map of Equivalence links...
Finished assembly for PMC4674766
  525s: PMC4674766: finished writing JSON to /Users/gus/repos/reach/nxml/bugs/output/debug
[success] Total time: 637 s, completed Aug 16, 2016 6:31:15 AM

PMC4674766-output.zip

wc -l on the output directory gives the following statistics:

   27917 nxml/bugs/output/debug/PMC4674766.uaz.entities.json
    6076 nxml/bugs/output/debug/PMC4674766.uaz.events.json
    7546 nxml/bugs/output/debug/PMC4674766.uaz.links.json
    5509 nxml/bugs/output/debug/PMC4674766.uaz.sentences.json
   47048 total
MihaiSurdeanu commented 8 years ago

Thanks guys. So this works without assembly? Please send that output to Hans so he can move forward. And then, please try to debug that outputter...

myedibleenso commented 8 years ago

I attached the output for this paper WITH assembly. It does take awhile after hitting the outputter, but the paper completes (at least on my laptop).

hickst commented 8 years ago

And you are using the locally compiled/Published versions of the libs from Reach?

+  "org.clulab" % "bioresources" % "1.1.15",
+  "org.clulab" %% "processors" % "5.9.4-SNAPSHOT",
+  "org.clulab" %% "processors" % "5.9.4-SNAPSHOT" classifier "models",
myedibleenso commented 8 years ago

No, I used whatever version is in the build.sbt on master. I wasn't aware we needed the SNAPSHOT version for the eval. What's different there?

myedibleenso commented 8 years ago

Looks like this:

  "org.clulab" % "bioresources" % "1.1.14",
  "org.clulab" %% "processors" % "5.9.3",
  "org.clulab" %% "processors" % "5.9.3" classifier "models",
hickst commented 8 years ago

I added a debugging line to see if/when assembly finishes (immediately after line 152 in ReachCLI). It seems to hang when hitting the outputter.

That's doesn't rule out Assembly since the outputter calls into Assembly code many times during output; using the output degrader, making the link frames, and getting the input features by participants.

hickst commented 8 years ago

@myedibleenso that's not even the current build file for Processors in master. Please update Processors and recompile locally, and point Reach to that locally compile SNAPSHOT.

myedibleenso commented 8 years ago

Ah, I guess what I meant was that the sieves completed and the maps for equiv and precedence links were constructed successfully. You're right, it could be one of the methods in the assembler written for the sake of output.

It looks like the only difference in dependencies is the newer bioresources.

hickst commented 8 years ago

It looks like the only difference in dependencies is the newer bioresources.

Yes, which version includes many important updates, changes, and feedback items from collaborators. These could make a big difference in extractions and, therefore in theory, a big difference in assembly and/or output.

myedibleenso commented 8 years ago

Hold your horses, cowboy, I'll test it again in a moment. Sorry, I just wasn't aware those versions were needed when you opened the issue. @MihaiSurdeanu merged my PR for logging, so it will hopefully make this easier to debug.

hickst commented 8 years ago

Does that logging include milli/or nano-second timing.?...I could really use that now as I just augmented the outputter with crude java Date timing.

Even that crude logging is showing that calls to getInputFeaturesByParticipants are taking as long as 4 minutes per call on the problem file.

myedibleenso commented 8 years ago

Even that crude logging is showing that calls to getInputFeaturesByParticipants are taking as long as 4 minutes per call on the problem file.

Alright, that's helpful. Will look into it.

Regarding the logger, you can log whatever message you want. I'm not sure the granularity of the log message timing, but you could always send your custom message to the logger.

hickst commented 8 years ago

Yes, but it's the consistent timing features that most logging libraries provide that would be most useful in our current situation. After all, I can output my own custom messages now with println -- it's the consistent timing across classes I don't have.

myedibleenso commented 8 years ago

Probably, but idk. You'd need to read the logback-classic documentation.

hickst commented 8 years ago

Here's a sample from the Java Date augmented output:

Tue Aug 16 08:50:35 MST 2016: Starting makeModels)
Tue Aug 16 08:50:38 MST 2016: Completed Output Degrader)
Tue Aug 16 08:50:38 MST 2016: Completed Sanitized Mentions)
Tue Aug 16 08:50:38 MST 2016: Completed Sentence Model)
Tue Aug 16 08:50:38 MST 2016: Completed EMaps)
Tue Aug 16 08:50:38 MST 2016: Completed Entity Model)
Tue Aug 16 08:50:38 MST 2016: Begin getInputFeatures)
Tue Aug 16 08:50:38 MST 2016: End getInputFeatures)
Tue Aug 16 08:50:38 MST 2016: Begin getInputFeatures)
Tue Aug 16 08:50:38 MST 2016: End getInputFeatures)
Tue Aug 16 08:50:38 MST 2016: Begin getInputFeatures)
Tue Aug 16 08:54:22 MST 2016: End getInputFeatures)

and just a few (quick) calls after that:

Tue Aug 16 08:54:22 MST 2016: Begin getInputFeatures)
Tue Aug 16 08:54:22 MST 2016: End getInputFeatures)
Tue Aug 16 08:54:22 MST 2016: Begin getInputFeatures)
Tue Aug 16 08:58:28 MST 2016: End getInputFeatures)

and so on....

myedibleenso commented 8 years ago

Does that mean you found a place where it appears to hang? Do you have any more details on what kind of mention (the event's label) is taking up to 4 minutes to churn?

I think I can probably speed it up by caching output along the way.

hickst commented 8 years ago

BTW - I note that the time that I ran this for 6 hours on Jenny, it never appeared to use more than 12G of memory, even though I had allocated 160G. This implies to me that it was not a memory bound problem. Could it be some kind of cycles in the data structures?

myedibleenso commented 8 years ago

Ha, I think I "fixed" cycles once. Sounds possible/likely. Would be nice to see which one it's hanging on.

I believe this is the line in FriesOutput where the magic happens, yes?

val inputFeatures = assemblyApi.map(_.getInputFeaturesByParticipants(mention))
hickst commented 8 years ago

yes, that's the line surrounded by the Begin/End messages shown above. I think it only appears to hang -- it looks like it's still chugging along; using high cpu and producing Begin/End getInputFeature messages every few minutes.

hickst commented 8 years ago

Of course, we can't rule out that the outputter is trapped in a cycle of calling that method based on some cycle in the mention sequence.

myedibleenso commented 8 years ago

Fixed it. Should be much faster on every paper now (remaining bottleneck is the feature extractor for the precedence classifier sieve) . Will open a PR soon...