JBGruber / opinion-wg2

3 stars 2 forks source link

Build model for relevance classification #19

Open JBGruber opened 3 months ago

JBGruber commented 3 months ago

We now have a dataset of 1,002 coded abstracts, 554 of which are relevant (based on #9). This was a lot of work and I can't thank everyone who was involved enough. However, there are still rougly 4,000 unlabelled ones.

Coding more manually is not really worth it I think. But if we can build a model that does it for us, we could add more to the full paper annotation on demand (assuming this can be done somewhat automatically as well).

In short: we should use the coded abstracts to build a classifier.

JBGruber commented 3 weeks ago

Salamanca task description

The task is relativly straightforward.

Here is the manually annotated data: https://drive.google.com/file/d/1uWkoGrdIaSIwagJpgB1WqSMGniQ94mph/view?usp=drive_link

The annotations were done based on title and text of the abstracts. You could also experiment with using more variables.

You should make a fork on Github and I would suggest you work in Quarto, as we did in the rest of the project. But an R or Python script is also fine.

Determining if research was relevant for us (ie if they were using or deveoping a tool for opinion mining) was rather difficult in the manual annotation. so failure is a possibiity I think. But it's worth trying if the machine can do this.