VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
454 stars 135 forks source link

buildCrawler error #353

Closed mnavasloro closed 1 year ago

mnavasloro commented 1 year ago

I get the following error when trying to build a model using the sample_data data (both locally and using Docker):

docker run -v \config:/config -v /data:/data -p 8080:8080 vidanyu/ache buildModel -c /config/sample_config/stopwords.txt -t /config/sample_training_data -o /config/sample_model

The problem seems to be the arff file generated.

-------------------
ACHE Crawler 0.15.0
-------------------

Preparing training data...
Positive samples: 169
Negative samples: 443
Featurizing positive samples...
Featurizing negative samples...
Selecting best features based on page frequency...
Training target classifier model...
Learning algorithm: SVM
Writting temporarily data file to: /config/sample_training_data/smile_input.arff
Failed to build model.

java.text.ParseException: Invalid attribute type or invalid enumeration
        at smile.data.parser.ArffParser.parseAttribute(ArffParser.java:280)
        at smile.data.parser.ArffParser.readHeader(ArffParser.java:210)
        at smile.data.parser.ArffParser.parse(ArffParser.java:401)
        at achecrawler.target.classifier.SmileTargetClassifierBuilder.trainModel(SmileTargetClassifierBuilder.java:40)
        at achecrawler.target.classifier.TargetClassifierBuilder.train(TargetClassifierBuilder.java:110)
        at achecrawler.Main$BuildModel.run(Main.java:182)
        at achecrawler.Main.main(Main.java:59)
mnavasloro commented 1 year ago

I got to solve the issue, added a new library for multi-language detection (so you can define a list of target languages) and also added KNN algorithm as an option. The fork is available here https://github.com/mnavasloro/ache-multilingual