dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
213 stars 36 forks source link

command line hangs and log4j error #4

Closed ramkikannan closed 8 years ago

ramkikannan commented 9 years ago

Environment : mac osx, oracle java java version "1.8.0_25".

Scenario 1: As mentioned in the link https://github.com/AKSW/Palmetto/wiki/How-Palmetto-can-be-used, I downloaded the jar file from the link and ran the following command. java -jar palmetto-0.1.0-jar-with-dependencies.jar ~/Documents/MATLAB/xdata/Wikipedia_bd/wikipedia_bd UCI enron_nmf 2015-10-01 08:58:15,008 INFO [org.aksw.palmetto.Palmetto] - <Read 20 from file.> It is taking around 100% CPU and I am not seeing any output.

Scenario 2 : I exported the eclipse project into runnable jar. When I run from command line java -jar Palmetto.jar, I am getting the following error and it hangs there. log4j:WARN No appenders could be found for logger (org.aksw.palmetto.Palmetto). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. This problem does not occur when I run inside the command line. Here is the topics file enron_nmf. I could not attach this as a file as github is throwing some errors.


power electricity california plant energy billion states prices percent power_plant utility utilities generator official californias megawatt cost month blackout price game allowed defense yard rank passing fantasy point rushing matchup against hes tough start team look games guy carries opposing final schedule sc_id mkt_type trans_date trans_type found preferred epmi deal_no engy pnt_of_intrc purch_sale sched_type trading_sc np15 sp15 engy_type interchg_id tie_point energy market business skilling company gas trading percent companies stock natural_gas houston oil quarter corp industry customer service analyst profit pst columbia mid avista aquila epme morgan bpa idacorpene mieco detm emmt aet hafslund eesi engage montana sncl psc border hotel travel miles offer roundtrip bonus special city free fares deal resort book visit save click houston rates sheraton night product experience marketing business management degree manager sales skill team development location title mba company partner customer director strong program market ferc california price prices iso order cost contract commission demand generator price_cap capacity customer generation transmission load refund report committee senate document contempt dunn fcc sen corp price bill agreement mirant panel investigation against carrier generator committees full subpoena customer bill edison pge utility rate plan puc rates cost utilities bankruptcy bond pay consumer commission proposal debt gas_electric billion page court worker union labor employees law date employer report employment federal rules labour act agreement national employee claim circuit database error engine occurred attempting borland initialize operation closed perform disk space hourahead insufficient sql general read date hour file power project india government dpc dabhol indian electricity mseb board foreign corp dabhol_power lender energy company billion gas maharashtra investment stock financial investor fastow partnership shares analyst chief cent market transaction credit billion sec officer wall_street shareholder rating concern confidence texas team play game longhorn season updated against brown player yard top start big football free injury coach games william company financial business lay deal bank executive asset investor employees chairman credit companies capital management equity firm month officer chief dynegy stock deal billion earning shares merger company trading investor companies financial debt cent market credit corp rating share price davis stock power energy governor consultant maviglio calpine public monday contract california july refund company electricity utility official billion interest energy davis california bush plan governor president bill republican blackout federal ferc democrat price_cap word problem political summer conservation energy_crisis firm fund round ventures partner technology services capital investor venturewire investment funding software network group internet provider series raised product


ramkikannan commented 9 years ago

Ignore this issue for sometime. Redownloaded the wikipedia index, rebuilt with master. The command line appears to work. Please hold on for sometime in working on this problem.

MichaelRoeder commented 9 years ago

Hi,

this is no bug, but a missing feature.

I tested the first line of your file and I got a result (even if it took some time). 0 -0,93258 [power, electricity, california, plant, energy, billion, states, prices, percent, power_plant, u tility, utilities, generator, official, californias, megawatt, cost, month, blackout, price]

Internally Palmetto generates the probabilities for the occurence of every subset of the given wordset. This is needed for the most coherences. Note that the effort of this step dubles with every new word added to the word set. Unfortunately, this step is done even it is not needed for a coherence like UCI that only relies on probabilities of words and word pairs.

I think this is the reason why it takes that long to calculate results for your examples. Originally, it was planed that Palmetto detects whether it needs to generate all subsets or just some of them. But I have never implemented this part because it was not needed as long as the word sets are below 15 words. Thus, I might increase the performance of the UCI coherence in the future (if I can find some time to do it).

some additional hints:

ramkikannan commented 8 years ago

What is the stemmer(Snowball or anything else)/lemmatization (algorithm) used by Palmetto to create the index? I will also use the same one to test out my algorithm.

MichaelRoeder commented 8 years ago

The index of Palmetto has been created using the Morphadorner. But I used version 1.0 that might not be available anymore. However, even using a different lemmatizer, e.g., Stanford, is better than using no lemmatizer for the creation of word sets while the index of the evaluation has been created using lemmatization.