Important: I will pause this issue and advance on #75 . Fetching negative-labelled articles might take at least 1 week.
Problem
The NGO's tagged data only contains positive labels (e.g., this tweet IS a public service report). At this point, we haven't included negative labels (e.g., this tweet is NOT a public service report).
Proposed Solution
Add negative labels from the 2020 tagged data.
Tasks
Use the data annotated last year in C4V for Negative Labels (and positive if quick)
[x] Compare last year's annotated data schema with positive label dataset to assess if it's possible to include it.
[x] If possible, add the negative labels to the development dataset.
The labels are not compatible with #56 , at this point in time it would take more effort to unify the schemas rather than web scraping from scratch.
If it's not possible to use last year's negative labels
~- [ ] Webscrape el pitazo articles where the URLs are not within #48 dataset.~
This will give us articles that are not public services problems.
~- [ ] Concatenate these articles with #48 dataset.~
Negative labels web scraping strategy
Loop over elpitazo.net/category/<LOCATION>/page/<N> to get all the links from PSCDD positive labels dataset.
Select the links that aren't in PSCDD positive labels dataset.
Webscrape this links with PSCDD elpitazo web scraper.
[ ] Create elpitazo page discovery web scraper
[x] Extract links
[ ] Extract news articles' date. This will take a bit more time than I thought.
[ ] Fetch el pitazo links for occidente and store it within a list.
[ ] Find links that don't match with PSCDD positive labels links.
[ ] Web scrape non-matched positive label links.
Notes
The dates from the #56 range from 2019-05-02 to 2020-06-10
count 2401
unique 397
top 2020-06-10 00:00:00
freq 27
first 2019-05-02 00:00:00
last 2020-10-30 00:00:00
Important: I will pause this issue and advance on #75 . Fetching negative-labelled articles might take at least 1 week.
Problem
The NGO's tagged data only contains positive labels (e.g., this tweet IS a public service report). At this point, we haven't included negative labels (e.g., this tweet is NOT a public service report).
Proposed Solution
Add negative labels from the 2020 tagged data.
Tasks
Use the data annotated last year in C4V for Negative Labels (and positive if quick)
If it's not possible to use last year's negative labels ~- [ ] Webscrape el pitazo articles where the URLs are not within #48 dataset.~
Negative labels web scraping strategy
elpitazo.net/category/<LOCATION>/page/<N>
to get all the links fromPSCDD
positive labels dataset.PSCDD
positive labels dataset.PSCDD
elpitazo web scraper.occidente
and store it within a list.PSCDD
positive labels links.Notes
News articles per location