fozziethebeat / S-Space

The S-Space repsitory, from the AIrhead-Research group
GNU General Public License v2.0
203 stars 106 forks source link

Setting `inputFormat` to TEXT or SPARSE_TEXT doesn't write to file (Hadoop RI) #55

Closed shulhi closed 10 years ago

shulhi commented 10 years ago

There are two bugs when running the Hadoop RI.

  1. Whenever the supplied --inputFormat is either TEXT or SPARSE_TEXT. Somehow, the buffer didn't get flushed even on close(). So, I manually flush the buffer everytime it is calling the write method. It is also not flushing the buffer when writing the header of file (called during writeEmptyHeader())
  2. There is logic error when iterating each occurrences of word. It caused the last occurrence of word to not be printed. i.e. If my document is you know nothing Jon Snow, it writes the vector for all except the last word Snow. Not necessarily the last word in the sentence though, depends how it got sorted during mapper-reducer phase, but one of the words will definitely be missing when writing to file.

Anyway, thanks for the great package!

davidjurgens commented 10 years ago

Wow, this is great to know and thank you for the fix. I am bit surprised anyone is using the Hadoop code actually, so it's nice to know that it's still working after all the Hadoop API updates since we wrote it.

shulhi commented 10 years ago

It is still working fine, although few warnings for not using the latest API. I'll try to update it to the latest when I have the time. Thanks again.