code-for-venezuela / c4v-py

3 stars 2 forks source link

Add Negative Labels [PSCDD] #57

Open dieko95 opened 3 years ago

dieko95 commented 3 years ago

Important: I will pause this issue and advance on #75 . Fetching negative-labelled articles might take at least 1 week.

Problem

The NGO's tagged data only contains positive labels (e.g., this tweet IS a public service report). At this point, we haven't included negative labels (e.g., this tweet is NOT a public service report).

Proposed Solution

Add negative labels from the 2020 tagged data.

Tasks

Negative labels web scraping strategy

  1. Loop over elpitazo.net/category/<LOCATION>/page/<N> to get all the links from PSCDD positive labels dataset.
  2. Select the links that aren't in PSCDD positive labels dataset.
  3. Webscrape this links with PSCDD elpitazo web scraper.

Notes

count                    2401
unique                    397
top       2020-06-10 00:00:00
freq                       27
first     2019-05-02 00:00:00
last      2020-10-30 00:00:00

News articles per location

count
occidente 519
gran-caracas 403
oriente 396
los-andes 287
los-llanos 284
centro 196
guayana 93
pitazo-en-la-calle 88
regiones 64
economia 21
infociudadanos 16
tecnologia 10
vista_2 8
reportajes 4
radio 3
alianzas 2
sucesos 2
salud 2
sin-categoria 1
fotogalerias 1
cronicas 1
dieko95 commented 3 years ago

I will pause this issue and advance on #75 . Fetching negative-labelled articles might take at least 1 week.

cc @Edilmo