dalab / web2text

Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18
MIT License
168 stars 31 forks source link

What is "fe"? #3

Closed ghost closed 6 years ago

ghost commented 6 years ago

I want to extract CleanEval data, and it might be explained on README, like this:

import ch.ethz.dalab.web2text.utilities.Util
import ch.ethz.dalab.web2text.cleaneval.CleanEval
import ch.ethz.dalab.web2text.output.CsvDatasetWriter

val data = Util.time{ CleanEval.dataset(fe) }

// Write block_features.csv and edge_features.csv
// Format of a row: page id, groundtruth label (1/0), features ...
CsvDatasetWriter.write(data, "./src/main/python/data")

// Print the names of the exported features in order
println("# Block features")
fe.blockExtractor.labels.foreach(println)
println("# Edge features")
fe.edgeExtractor.labels.foreach(println)

but I don't understand what "fe" is. Could you explain how to define "fe" ?

ghost commented 6 years ago

I thought it's FeatureExtractor, so I tried to run this code:

:require lib/jsoup-custom.jar 
:require lib/knowitall-cluewebextractor.jar
:require target/scala-2.10/boilerplate_2.10-2.0-SNAPSHOT.jar

import ch.ethz.dalab.web2text.cdom.CDOM
import ch.ethz.dalab.web2text.features.{FeatureExtractor, PageFeatures}
import ch.ethz.dalab.web2text.features.extractor._
import ch.ethz.dalab.web2text.alignment.Alignment
import ch.ethz.dalab.web2text.utilities.Util
import ch.ethz.dalab.web2text.cleaneval.CleanEval
import ch.ethz.dalab.web2text.output.CsvDatasetWriter

val unaryExtractor = DuplicateCountsExtractor + LeafBlockExtractor + AncestorExtractor(NodeBlockExtractor + TagExtractor(mode="node"), 1) + AncestorExtractor(NodeBlockExtractor, 2) + RootExtractor(NodeBlockExtractor) + TagExtractor(mode="leaf")
val pairwiseExtractor = TreeDistanceExtractor + BlockBreakExtractor + CommonAncestorExtractor(NodeBlockExtractor)
val extractor = FeatureExtractor(unaryExtractor, pairwiseExtractor)

val data = Util.time{ CleanEval.dataset(extractor) }

but this error happened:

error: missing or invalid dependency detected while loading class file 'PageFeatures.class'.
Could not access term breeze in package <root>,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'PageFeatures.class' was compiled against an incompatible version of <root>.
error: missing or invalid dependency detected while loading class file 'PageFeatures.class'.
Could not access term linalg in value breeze,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'PageFeatures.class' was compiled against an incompatible version of breeze.

What is that?

tvogels commented 6 years ago

You are right: fe stands for FeatureExtractor. The error you are getting looks like there is a missing dependency. Could you please try adding this dependency: https://github.com/scalanlp/breeze/wiki/Installation

ghost commented 6 years ago

@tvogels I ran this before running it:

$ sbt
set scalaVersion := "2.10.4" // or 2.11.5
set libraryDependencies += "org.scalanlp" %% "breeze" % "0.12"
set libraryDependencies += "org.scalanlp" %% "breeze-viz" % "0.12"
set resolvers += "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
console

and then, tried to rerun the same code, and finally, It works!

Thank you so much!