term frequency endpoint

bartaelterman commented 9 years ago

For the frontend, we would need a REST GET endpoint that the frontend can send a call to with the parameters start_date:<date>, end_date:<date> and max_words:<integer> It should return the <max_words> number of words with the highest term frequency*. Words are taken from the articles cleaned_text attribute, converted to lowercase and stopwords are removed**. Only articles whose published_date >= start_date and published_date <=end_date and epu_score>-0.15 are considered (so both are inclusive).

note: an articles published_at attribute is a datetime object, while the endpoint is querying for dates. The incoming start_date and end_date should be compared to the date part of the articles published_at value.

*The term frequency is defined as the actual number of times that a word occurs in a given text. (I assume a Counter could be used for this.). As mentioned before, convert words to lowercase before counting.

**Stopwords are words listed in a file called stopwords.txt.

The result of this endpoint should look like this:

{
    "records": [
        {"term": "word1", "size": 42},
        {"term": "word2", "size": 63},
        {"term": "word3", "size": 2},
        ...
    ]
}

Example

I have two articles (I'll show them in JSON here):

{
    "published_at": 2015-06-15 10:00:00,
    "cleaned_text": "This is an article about the evolution of this project",
    "epu_index": 0.4
    ...
}

and

{
    "published_at": 2015-06-16 14:42:00,
    "cleaned_text": "New book published about the evolution of nematodes this week",
    "epu_index": 5
    ...
}

and there is a stopwords.txt file containing the following words:

of
the
this

When a GET request comes in with start_date: "2015-06-01", end_date: "2015-06-17" and max_words: 2.

Then the output should be:

[
    {"word": "evolution", "count": 2},
    {"word": "about", "count": 2}
]

peterdesmet commented 9 years ago

Yes, from what I remember: only "positive words", i.e. words from positively ranked articles.

Please use start_date and end_date as parameters in the request, to be consistent with other APIs.

bartaelterman commented 9 years ago

I assume this means you want the parameters to be dates rather then datetimes.

bartaelterman commented 9 years ago

@peterdesmet should end_date be included or excluded?

peterdesmet commented 9 years ago

Indeed, I prefer dates over datetimes: a day is our most smallest unit and that is also how the current API is implemented: https://epu-index.herokuapp.com/api/epu/

Just like the other APIs, both dates should be inclusive: https://epu-index.herokuapp.com/api/epu/?format=json&start=2013-01-01&end=2013-01-02 returns 2 records.

peterdesmet commented 9 years ago

Can you update the issue body?

bartaelterman commented 9 years ago

Update: the words should not come from the article's text, but from cleaned text, which will be an additional parameter that will only contain the words (so no punctuation marks) from the text. Words should be converted to lowercase before counting their occurrence.

I'll update the body of this issue.

bartaelterman commented 9 years ago

@niconoe I added a file stopwords.txt to the repository. You can move it to a place where you think would be a better fit.

bartaelterman commented 9 years ago

Manually tested the endpoint and it looks ok. But some tests need to be implemented.

niconoe commented 9 years ago

Proper testing in place!

Datafable / epu-index

term frequency endpoint #8

Example