Closed MichaelAquilina closed 10 years ago
Alot of the current noise stems from the fact that text which is not part of the main article is included with the tokenisation process. These two python libraries extract the main page text which is very useful for your use case.
Alot of the current noise stems from the fact that text which is not part of the main article is included with the tokenisation process. These two python libraries extract the main page text which is very useful for your use case.