Open sschneiderman opened 11 years ago
That's partially implemented currently via Tag Weighting. When a user creates a source, they can set a number of user-defined tags. These tags are transmitted to each document coming across that particular harvest. If you provide a unique tag to each source, you can then define weights to apply to query scoring on the Advanced Options pane. The format "Tag1": number, "Tag2": number, etc... where the number is the weighting factor you want on the score. So for an RSS feed of CNN sources, you can tag it with "CNN" and then if you want all CNN documents to get weighted x 2, you'd put "CNN": 2 in the tag weighting. When you run a query, documents then will be assigned an overall score based on how well the document matches the query terms and then that will be weighted further by geo / time / tag weighting parameters that exist. Note that in the current implementation, you can update a source's tags, but this will only impact new documents - it's not retroactive. There's an open issue to alter this functionality to be retroactive, but we do not have an ETA at this time as to when it might be worked into an upcoming build.
From a functional perspective sense, the case management layer would also partially resolve the issue you're describing because once an analyst flags a document relevant to a case, it can be moved into the supporting evidence folder. At that level then, you'll only be working with documents deemed relevant by an analyst and the analysis / collection layer retains granular query-specific relevance.
On Wed, Apr 24, 2013 at 11:42 AM, sschneiderman notifications@github.comwrote:
Andrew, We previously discussed methods for promoting or demoting source documents based on analyst judgment. This was an interest of both Aveshka and CGS. Pls advise if there is any follow up on how this might work. Thanks, Scott
— Reply to this email directly or view it on GitHubhttps://github.com/IKANOW/Absolute-Pin/issues/74 .
Andrew Strite Intelligence Solutions Architect | IKANOW http://www.ikanow.com Email: astrite@ikanow.com Mobile: 301.514.1384
Can you provide training on Thursday on how Tag Weighting would be applied to reduce false positives on similar names (John Smith the target versus John Smith the innocent bystander)? I understand the principle but not the implementation. Thanks.
From: Andrew [mailto:notifications@github.com] Sent: Wednesday, April 24, 2013 12:24 PM To: IKANOW/Absolute-Pin Cc: Scott Schneiderman Subject: Re: [Absolute-Pin] Signals/Noise Issue (#74)
That's partially implemented currently via Tag Weighting. When a user creates a source, they can set a number of user-defined tags. These tags are transmitted to each document coming across that particular harvest. If you provide a unique tag to each source, you can then define weights to apply to query scoring on the Advanced Options pane. The format "Tag1": number, "Tag2": number, etc... where the number is the weighting factor you want on the score. So for an RSS feed of CNN sources, you can tag it with "CNN" and then if you want all CNN documents to get weighted x 2, you'd put "CNN": 2 in the tag weighting. When you run a query, documents then will be assigned an overall score based on how well the document matches the query terms and then that will be weighted further by geo / time / tag weighting parameters that exist. Note that in the current implementation, you can update a source's tags, but this will only impact new documents - it's not retroactive. There's an open issue to alter this functionality to be retroactive, but we do not have an ETA at this time as to when it might be worked into an upcoming build.
From a functional perspective sense, the case management layer would also partially resolve the issue you're describing because once an analyst flags a document relevant to a case, it can be moved into the supporting evidence folder. At that level then, you'll only be working with documents deemed relevant by an analyst and the analysis / collection layer retains granular query-specific relevance.
On Wed, Apr 24, 2013 at 11:42 AM, sschneiderman notifications@github.com<mailto:notifications@github.com>wrote:
Andrew, We previously discussed methods for promoting or demoting source documents based on analyst judgment. This was an interest of both Aveshka and CGS. Pls advise if there is any follow up on how this might work. Thanks, Scott
— Reply to this email directly or view it on GitHubhttps://github.com/IKANOW/Absolute-Pin/issues/74 .
Andrew Strite Intelligence Solutions Architect | IKANOW http://www.ikanow.com Email: astrite@ikanow.commailto:astrite@ikanow.com Mobile: 301.514.1384
— Reply to this email directly or view it on GitHubhttps://github.com/IKANOW/Absolute-Pin/issues/74#issuecomment-16945286.
That's a slightly different issue. Tag weighting is appropriate for inflating the score of a particular kind of document (eg all those from CNN or Databot) which will ensure that certain kinds of documents show up before others.
"False positives" like the one you describe are better solved using alternative query strategies and query qualifiers, and to a lesser extent aliasing. Selecting documents that match the correct John Smith and finding associated entities will give you additional query parameters. These terms, if included in the query for John Smith, should push the relevant documents up to the top.
eg John Smith AND ( Company A OR Company B OR Associate A OR Associate B)
Alternately, if you have a scenario where you have John Smith (incorrect person) and John B. Smith (correct person), you can either discard one of the entities so it not longer displays or run queries like:
eg (John B. Smith OR "John Smith") NOT John Smith.
A certain amount experimentation is probably required to develop an effective query.
As an aside, John Smith (the accountant) vs. John Smith (the priest) isn't a true false positive. In both cases, a query for John Smith should bring back matches with "John Smith" (of whatever entity type you define) back. A false positive would be if documents were getting labeled with John Smith when they are not actually about that entity. This is more the situation where an advertisement might flag a document to be about a company, but it is not actually in the text.
On Wed, Apr 24, 2013 at 12:30 PM, sschneiderman notifications@github.comwrote:
Can you provide training on Thursday on how Tag Weighting would be applied to reduce false positives on similar names (John Smith the target versus John Smith the innocent bystander)? I understand the principle but not the implementation. Thanks.
From: Andrew [mailto:notifications@github.com] Sent: Wednesday, April 24, 2013 12:24 PM To: IKANOW/Absolute-Pin Cc: Scott Schneiderman Subject: Re: [Absolute-Pin] Signals/Noise Issue (#74)
That's partially implemented currently via Tag Weighting. When a user creates a source, they can set a number of user-defined tags. These tags are transmitted to each document coming across that particular harvest. If you provide a unique tag to each source, you can then define weights to apply to query scoring on the Advanced Options pane. The format "Tag1": number, "Tag2": number, etc... where the number is the weighting factor you want on the score. So for an RSS feed of CNN sources, you can tag it with "CNN" and then if you want all CNN documents to get weighted x 2, you'd put "CNN": 2 in the tag weighting. When you run a query, documents then will be assigned an overall score based on how well the document matches the query terms and then that will be weighted further by geo / time / tag weighting parameters that exist. Note that in the current implementation, you can update a source's tags, but this will only impact new documents - it's not retroactive. There's an open issue to alter this functionality to be retroactive, but we do not have an ETA at this time as to when it might be worked into an upcoming build.
From a functional perspective sense, the case management layer would also partially resolve the issue you're describing because once an analyst flags a document relevant to a case, it can be moved into the supporting evidence folder. At that level then, you'll only be working with documents deemed relevant by an analyst and the analysis / collection layer retains granular query-specific relevance.
On Wed, Apr 24, 2013 at 11:42 AM, sschneiderman <notifications@github.com mailto:notifications@github.com>wrote:
Andrew, We previously discussed methods for promoting or demoting source documents based on analyst judgment. This was an interest of both Aveshka and CGS. Pls advise if there is any follow up on how this might work. Thanks, Scott
— Reply to this email directly or view it on GitHub< https://github.com/IKANOW/Absolute-Pin/issues/74> .
Andrew Strite Intelligence Solutions Architect | IKANOW http://www.ikanow.com Email: astrite@ikanow.commailto:astrite@ikanow.com Mobile: 301.514.1384
— Reply to this email directly or view it on GitHub< https://github.com/IKANOW/Absolute-Pin/issues/74#issuecomment-16945286>.
— Reply to this email directly or view it on GitHubhttps://github.com/IKANOW/Absolute-Pin/issues/74#issuecomment-16945722 .
Andrew Strite Intelligence Solutions Architect | IKANOW http://www.ikanow.com Email: astrite@ikanow.com Mobile: 301.514.1384
Understood. Lets discuss again Thursday.
From: Andrew [mailto:notifications@github.com] Sent: Wednesday, April 24, 2013 12:49 PM To: IKANOW/Absolute-Pin Cc: Scott Schneiderman Subject: Re: [Absolute-Pin] Signals/Noise Issue (#74)
That's a slightly different issue. Tag weighting is appropriate for inflating the score of a particular kind of document (eg all those from CNN or Databot) which will ensure that certain kinds of documents show up before others.
"False positives" like the one you describe are better solved using alternative query strategies and query qualifiers, and to a lesser extent aliasing. Selecting documents that match the correct John Smith and finding associated entities will give you additional query parameters. These terms, if included in the query for John Smith, should push the relevant documents up to the top.
eg John Smith AND ( Company A OR Company B OR Associate A OR Associate B)
Alternately, if you have a scenario where you have John Smith (incorrect person) and John B. Smith (correct person), you can either discard one of the entities so it not longer displays or run queries like:
eg (John B. Smith OR "John Smith") NOT John Smith.
A certain amount experimentation is probably required to develop an effective query.
As an aside, John Smith (the accountant) vs. John Smith (the priest) isn't a true false positive. In both cases, a query for John Smith should bring back matches with "John Smith" (of whatever entity type you define) back. A false positive would be if documents were getting labeled with John Smith when they are not actually about that entity. This is more the situation where an advertisement might flag a document to be about a company, but it is not actually in the text.
On Wed, Apr 24, 2013 at 12:30 PM, sschneiderman notifications@github.com<mailto:notifications@github.com>wrote:
Can you provide training on Thursday on how Tag Weighting would be applied to reduce false positives on similar names (John Smith the target versus John Smith the innocent bystander)? I understand the principle but not the implementation. Thanks.
From: Andrew [mailto:notifications@github.com] Sent: Wednesday, April 24, 2013 12:24 PM To: IKANOW/Absolute-Pin Cc: Scott Schneiderman Subject: Re: [Absolute-Pin] Signals/Noise Issue (#74)
That's partially implemented currently via Tag Weighting. When a user creates a source, they can set a number of user-defined tags. These tags are transmitted to each document coming across that particular harvest. If you provide a unique tag to each source, you can then define weights to apply to query scoring on the Advanced Options pane. The format "Tag1": number, "Tag2": number, etc... where the number is the weighting factor you want on the score. So for an RSS feed of CNN sources, you can tag it with "CNN" and then if you want all CNN documents to get weighted x 2, you'd put "CNN": 2 in the tag weighting. When you run a query, documents then will be assigned an overall score based on how well the document matches the query terms and then that will be weighted further by geo / time / tag weighting parameters that exist. Note that in the current implementation, you can update a source's tags, but this will only impact new documents - it's not retroactive. There's an open issue to alter this functionality to be retroactive, but we do not have an ETA at this time as to when it might be worked into an upcoming build.
From a functional perspective sense, the case management layer would also partially resolve the issue you're describing because once an analyst flags a document relevant to a case, it can be moved into the supporting evidence folder. At that level then, you'll only be working with documents deemed relevant by an analyst and the analysis / collection layer retains granular query-specific relevance.
On Wed, Apr 24, 2013 at 11:42 AM, sschneiderman <notifications@github.com mailto:notifications@github.com%20%0b> mailto:notifications@github.com>wrote:
Andrew, We previously discussed methods for promoting or demoting source documents based on analyst judgment. This was an interest of both Aveshka and CGS. Pls advise if there is any follow up on how this might work. Thanks, Scott
— Reply to this email directly or view it on GitHub< https://github.com/IKANOW/Absolute-Pin/issues/74> .
Andrew Strite Intelligence Solutions Architect | IKANOW http://www.ikanow.com Email: astrite@ikanow.commailto:astrite@ikanow.commailto:astrite@ikanow.com%3cmailto:astrite@ikanow.com Mobile: 301.514.1384
— Reply to this email directly or view it on GitHub< https://github.com/IKANOW/Absolute-Pin/issues/74#issuecomment-16945286>.
— Reply to this email directly or view it on GitHubhttps://github.com/IKANOW/Absolute-Pin/issues/74#issuecomment-16945722 .
Andrew Strite Intelligence Solutions Architect | IKANOW http://www.ikanow.com Email: astrite@ikanow.commailto:astrite@ikanow.com Mobile: 301.514.1384
— Reply to this email directly or view it on GitHubhttps://github.com/IKANOW/Absolute-Pin/issues/74#issuecomment-16946849.
Andrew, We previously discussed methods for promoting or demoting source documents based on analyst judgment. This was an interest of both Aveshka and CGS. Pls advise if there is any follow up on how this might work. Thanks, Scott