dbpedia / fact-extractor

Fact Extraction from Wikipedia Text
529 stars 79 forks source link

Provide exhaustive stop word lists for all the supported languages #13

Closed marfox closed 9 years ago

marfox commented 9 years ago

This script produces a frequency dictionary of words given a text corpus, skipping non-lexical frequent words (AKA stop words). Currently, the stop word list is only implemented for Italian.

Add exhaustive stop word lists for all the languages we want to support, currently as per TreeTagger.

Warun26 commented 9 years ago

I am new to open source development and I thought I will try my hand at this task. Here are my questions:

  1. I am not participating in GSoC. So I was wondering if I can still take this task up.
  2. RanksNL(http://www.ranks.nl/stopwords) has a list of stop words for most of the languages specified in TreeTagger. The exceptions being Swahili, Latin and Estonian. Can I use this list?
  3. bag_of_words.py has the stop words in the form of a single python list. Will it be better to have a text file of stop words and read them at run time, in case many modules require stop word elimination before further processing?

Please let me know how I can proceed with the task.

marfox commented 9 years ago

Hi there!

On 3/17/15 8:17 AM, Warun26 wrote:

I am new to open source development and I thought I will try my hand at this task. Here are my questions:

  1. I am not participating in GSoC. So I was wondering if I can still take this task up. Of course you can! We are always looking for prospective contributors to join our dev team at DBpedia!
  2. RanksNL(http://www.ranks.nl/stopwords) has a list of stop words for most of the languages specified in TreeTagger. The exceptions being Swahili, Latin and Estonian. Sounds good.
  3. bag_of_words.py has the stop words in the form of a single python list. Will it be better to have a text file of stop words and read them at run time, in case many modules require stop word elimination before further processing? Sure, good idea

Please let me know how I can proceed with the task. Feel free to submit a pull request! Cheers,

Marco

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/13#issuecomment-82172977.

Warun26 commented 9 years ago

Thank you!