LucidAi / nlcd

News Life Cycle Detector
MIT License
3 stars 1 forks source link

Add segment filter, based on google search results. #1

Closed zaycev closed 10 years ago

zaycev commented 10 years ago

Background

Currently, we filter origin sentences using number and thresholds for different key-words:

This method is slow, hard to tune and provides poor results without training accurate models.

However, we can try to use Google CSE search results to estimate how general or unique/important given sentence is.

Task

zaycev commented 10 years ago

I added couple of results for filtering, based on number of CSE search results. Result JSON files can be found in this directory.

To me, it looks like we probably need a finer sentence segmentation, especially for long sentences, since "exactTerms" option forces CSE to be very strict. For example:

{
    "cseUrl": "https://www.googleapis.com/customsearch/v1?exactTerms=The+report+also+concluded+that+chemical+weapons+had+been+used+in+the+northwest+town+of+Saraqeb+on+April+29%2C+based+on+evidence+that+included+interviews+with+medical+clinicians%2C+medical+records+and+organ+samples+of+a+deceased+victim.&q=The+report+also+concluded+that+chemical+weapons+had+been+used+in+the+northwest+town+of+Saraqeb+on+April+29%2C+based+on+evidence+that+included+interviews+with+medical+clinicians%2C+medical+records+and+organ+samples+of+a+deceased+victim.&cx=006963090954080588802:tttidla4was&key=AIzaSyCPgS44dQwte1_H5-AK_kAZIbLjrZSwXIU", 
    "isKey": true, 
    "minThreshold": 0, 
    "maxThreshold": 1000, 
    "text": "The report also concluded that chemical weapons had been used in the northwest town of Saraqeb on April 29, based on evidence that included interviews with medical clinicians, medical records and organ samples of a deceased victim.", 
    "totalResults": 1
}

CSE finds only the link to the original article, but if we split sentence into three parts (by commas?), we will probably get a better recall.

@eovchinn Do you know any good sub sentence segmenter?