Closed cpsievert closed 11 years ago
Hi Carson,
At the moment, we only accept a single file as input. The file should contain two columns per line. The first field is a document ID and the second field is the content of the document. The toy file you have above seems to contain only the document content and no identifier. Try
01 [tab] 1115 W Franklin
02 [tab] Bessy the Cow
03 [tab] Big Farm Way
04 [tab ]The cow jumped over the moon
05 [tab] Look at me
06 [tab] I'm some text
07 [tab] What will be next?
08 [tab] Who knows
09 [tab] Look, over there
in which case, your passage of text will be treated as 9 separate documents. Hope that helps!
Thanks for clearing that up! It might be worth explaining that small detail in the README file (or provide an example corpus).
I second. It would be most helpful if you included an example corpus in the repo---even if just to show the file structure.
I'm running into the following error:
This is what happens when the "documents" in toy.txt are line separated. I get the error
ValueError: too many values to unpack
when I try to separate by tabs. Here is the toy file:Any help is much appreciated!