Open nliu86 opened 6 years ago
@nliu86 Thanks for the suggestion. I have thought about the second representation as well, which probably does not take a lot of effort to implement. Feel free to implement that and submit a pull request if you're interested.
Can we do some optimization with input data file? Currently it looks like this: Input file format: roger federer loses (tab) venus williams wins (tab) world series ended i love cats (tab) funny lolcat links (tab) how to be a petsitter
It's likely the same document appears for a lot of users, thus the format is not very efficient. Can we change to the following format: input file 1: doc_1 (tab) doc_2 (tab) doc_3 doc_4 (tab) doc_5 (tab) doc_6
input file 2: doc_1 (tab) roger federer loses doc_2 (tab) venus williams wins doc_3 (tab) world series ended doc_4 (tab) i love cats doc_5 (tab) funny lolcat links doc_6 (tab) how to be a petsitter
Thanks!