StanfordHCI / termite

(development moved to new repos)
BSD 3-Clause "New" or "Revised" License
115 stars 36 forks source link

Problems generating example #2

Closed cpsievert closed 11 years ago

cpsievert commented 11 years ago

I'm running into the following error:

$ ./execute.py example.cfg
--------------------------------------------------------------------------------
Tokenizing source corpus...
    corpus_path = corpus/toy.txt (file)
    model_path = output/example-project/topic-model (mallet)
    data_path = output/example-project
    num_topics = 3
    number_of_seriated_terms = 10
--------------------------------------------------------------------------------
Current time = Wed Jun  5 11:01:52 2013
--------------------------------------------------------------------------------
Tokenizing source corpus...
    corpus_path = corpus/toy.txt (file)
    data_path = output/example-project
    tokenziation = [A-Za-z_]+
Connecting to data...
Reading from disk...
Traceback (most recent call last):
  File "./execute.py", line 164, in <module>
    main()
  File "./execute.py", line 161, in main
    Execute( logging_level ).execute( corpus_format, corpus_path, model_library, model_path, data_path, num_topics, number_of_seriated_terms )
  File "./execute.py", line 70, in execute
    Tokenize( self.logger.level ).execute( corpus_format, corpus_path, data_path )
  File "/cygdrive/u/termite/pipeline/tokenize.py", line 55, in execute
    self.documents.read()
  File "/cygdrive/u/termite/pipeline/api_utils.py", line 26, in read
    docID, docContent = line.split( '\t' )
ValueError: need more than 1 value to unpack

This is what happens when the "documents" in toy.txt are line separated. I get the error ValueError: too many values to unpack when I try to separate by tabs. Here is the toy file:

1115 W Franklin
Bessy the Cow
Big Farm Way
The cow jumped over the moon
Look at me
I'm some text
What will be next?
Who knows
Look, over there

Any help is much appreciated!

jcchuang commented 11 years ago

Hi Carson,

At the moment, we only accept a single file as input. The file should contain two columns per line. The first field is a document ID and the second field is the content of the document. The toy file you have above seems to contain only the document content and no identifier. Try

01 [tab] 1115 W Franklin
02 [tab] Bessy the Cow
03 [tab] Big Farm Way
04 [tab ]The cow jumped over the moon
05 [tab] Look at me
06 [tab] I'm some text
07 [tab] What will be next?
08 [tab] Who knows
09 [tab] Look, over there

in which case, your passage of text will be treated as 9 separate documents. Hope that helps!

cpsievert commented 11 years ago

Thanks for clearing that up! It might be worth explaining that small detail in the README file (or provide an example corpus).

yasminlucero commented 11 years ago

I second. It would be most helpful if you included an example corpus in the repo---even if just to show the file structure.