Closed bartaelterman closed 9 years ago
Yes, from what I remember: only "positive words", i.e. words from positively ranked articles.
Please use start_date
and end_date
as parameters in the request, to be consistent with other APIs.
I assume this means you want the parameters to be dates rather then datetimes.
@peterdesmet should end_date
be included or excluded?
Indeed, I prefer dates over datetimes: a day is our most smallest unit and that is also how the current API is implemented: https://epu-index.herokuapp.com/api/epu/
Just like the other APIs, both dates should be inclusive: https://epu-index.herokuapp.com/api/epu/?format=json&start=2013-01-01&end=2013-01-02 returns 2 records.
Can you update the issue body?
Update: the words should not come from the article's text, but from cleaned text
, which will be an additional parameter that will only contain the words (so no punctuation marks) from the text. Words should be converted to lowercase before counting their occurrence.
I'll update the body of this issue.
@niconoe I added a file stopwords.txt
to the repository. You can move it to a place where you think would be a better fit.
Manually tested the endpoint and it looks ok. But some tests need to be implemented.
Proper testing in place!
For the frontend, we would need a REST GET endpoint that the frontend can send a call to with the parameters
start_date:<date>
,end_date:<date>
andmax_words:<integer>
It should return the<max_words>
number of words with the highest term frequency*. Words are taken from the articlescleaned_text
attribute, converted to lowercase and stopwords are removed**. Only articles whosepublished_date >= start_date
andpublished_date <=end_date
andepu_score>-0.15
are considered (so both are inclusive).note: an articles
published_at
attribute is a datetime object, while the endpoint is querying for dates. The incomingstart_date
andend_date
should be compared to the date part of the articlespublished_at
value.*The term frequency is defined as the actual number of times that a word occurs in a given text. (I assume a
Counter
could be used for this.). As mentioned before, convert words to lowercase before counting.**Stopwords are words listed in a file called
stopwords.txt
.The result of this endpoint should look like this:
Example
I have two articles (I'll show them in JSON here):
and
and there is a stopwords.txt file containing the following words:
When a GET request comes in with
start_date: "2015-06-01"
,end_date: "2015-06-17"
andmax_words: 2
.Then the output should be: