[NLP] Consider adding distinction in research filter for automatically classified posts vs prediction based classification

Common-SenseMakers / sensemakers

Sensemakers infrastructure for developing AI-based tools for semantic annotations of social posts. Cross-poster app to publish your semantic posts on different networks.

GNU General Public License v3.0

1 stars 2 forks source link

[NLP] Consider adding distinction in research filter for automatically classified posts vs prediction based classification #76

Open ronentk opened 2 months ago

ronentk commented 2 months ago

For example, something like this

class SciFilterClassfication(Enum):
    NOT_CLASSIFIED = "not_classified"
    """ For posts automatically classified as research
    (for example based on citoid item types)"""
    RESEARCH_AUTO = "research_auto"
    """ For posts predicted to be related to research"""
    RESEARCH_PRED = "research_pred"
    """ For posts predicted to be unrelated to research"""
    NOT_RESEARCH = "not_research"

From the current form:

class SciFilterClassfication(Enum):
    NOT_CLASSIFIED = "not_classified"
    RESEARCH = "research"
    NOT_RESEARCH = "not_research"

The rationale is 1- it can help with the filter evaluation - differentiating between easy (auto) and hard cases (pred) 2 - we might want to use the information in the app to further organize the queue/UX

What do you think @ShaRefOh ?

ShaRefOh commented 2 months ago

We can present the data in a meaningful way, but not to evaluate it as a multi-label problem, as the True Labels are by def binary. What are the conditions for getting "research_auto"? I already have the types logged in the outcome dataset, I can simply use it to run an evaluation that includes aggregation of that data

ronentk commented 2 months ago


item_types_whitelist = [
    "bookSection",
    "journalArticle",
    "preprint",
    "book",
    "manuscript",
    "thesis",
    "presentation",
    "conferencePaper",
    "report",
]

# if any item types on the whitelist, pass automatically
    if len(set(result.item_types).intersection(set(item_types_whitelist))) > 0:
        return SciFilterClassfication.RESEARCH

(https://github.com/Common-SenseMakers/sensemakers/blob/nlp-dev/nlp/desci_sense/shared_functions/filters/research_filter.py)

ronentk commented 2 months ago

@ShaRefOh this condition holds for your annotations as well, right?

if len(set(result.item_types).intersection(set(item_types_whitelist))) > 0:
        return SciFilterClassfication.RESEARCH