TextpressoDevelopers / textpressocentral

Textpressocentral frontend web application
Other
2 stars 2 forks source link

Display of Accession in search results - duplicates? #30

Open goldturtle opened 5 years ago

goldturtle commented 5 years ago

@vanaukenk commented on Aug 17, 2015

What determines which paper accession is displayed in the search results?

A keyword search on 'mut-7' lists both WBPaper ID and PMIDs, even though the PMIDs have corresponding WBPaper IDs.

Sometimes the same sentence appears listed under each accession separately, but the sentence actually has a different score depending on the accession.

As an example, a search with 'MUT-7' and the 'mf enz activ assay' and 'mf enz activ verbs' categories lists, as the third and fourth entries, the same sentence with scores of 0.611 and 0.592, respectively, for WBPaper00024699 and PMID 15653635.

Thx. --Kimberly

goldturtle commented 5 years ago

@vanaukenk commented on Feb 29, 2016

Duplicate papers are still being returned with searches; one of the entries has all of the relevant IDs, the other only the PMID. The search scores are different for each entry.

goldturtle commented 5 years ago

@goldturtle commented on Feb 29, 2016

This smells like the paper is in the PMCOA corpus as well as the C. elegans corpus, but Yuling would need to investigate. As the papers are tokenized differently, the score differs.

M.

goldturtle commented 5 years ago

@vanaukenk commented on Feb 29, 2016

Yes, that makes sense. When there are duplicates, though, which one should be returned? Note also that the PMID only papers display formatting when you click on the arrow to see the sentences:

goldturtle commented 5 years ago

@vanaukenk commented on Feb 29, 2016

For testing purposes, this is the search that I performed to get these results: Search scope: sentence Keywords: DYN-1 Categories (match all): MFEA assay terms MFEA verbs

goldturtle commented 5 years ago

@vanaukenk commented on Jul 1, 2016

What is the current status of this issue wrt the C. elegans corpus? Screenshots of searches of the C. elegans corpus still display duplicate papers. Did we decide that we would go with the PMCOA version if it existed, and if not, then use the PDF version of the paper? Also, a related issue, how do we want to handle supplemental files? It doesn't look like PMCOA contains the supplemental material, but I don't know if that's universally true. When they are available, some labeling of supplemental materials would help indicate where the additional results are from.

vanaukenk commented 3 years ago

@goldturtle @valearna

I was doing some searches of the C. elegans Textpresso site and it looks like the duplicate paper problem is becoming even more pervasive:

image

I don't see this on the main Textpresso site, although the search results are very different there from the C. elegans site, as expected. Am I searching the correct site?

goldturtle commented 3 years ago

The site changed a bit in a sense that it now includes three literatures: C. elegans, C. elegans and Suppl, and C. elegans Supplementals. If you search more than one literatures, you could get multiple entries. Also, if you search C. elegans Supplementals, you will get multiple entries if your query finds mathces in multiple Supplementals.

Michael.

On 4/15/21 7:23 AM, vanaukenk wrote:

@goldturtle https://github.com/goldturtle @valearna https://github.com/valearna

I was doing some searches of the C. elegans Textpresso site and it looks like the duplicate paper problem is becoming even more pervasive:

image https://user-images.githubusercontent.com/1730534/114883468-f25ff500-9dd2-11eb-921f-05c26e5660c6.png

I don't see this on the main Textpresso site, although the search results are very different there from the C. elegans site, as expected. Am I searching the correct site?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TextpressoDevelopers/textpressocentral/issues/30#issuecomment-820461659, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB4CG3K45PNASN7MLSPTUTTI3ZG5ANCNFSM4HZEF4RQ.

vanaukenk commented 3 years ago

Okay, that makes more sense now. Thanks for pointing that out.

Should the results for checking 'C. elegans' AND 'C. elegans supplementals' be the same, then, as just selecting 'C. elegans and Supplementals? I didn't see that in the search that I'm doing, but maybe there is another reason for that?

Perhaps the default literature setting should be to check just 'C. elegans and supplementals' and then users could narrow that to either category if they want to. We could see what people think on the Textpresso call.

textpresso commented 3 years ago

On 4/15/21 12:05 PM, vanaukenk wrote:

Should the results for checking 'C. elegans' AND 'C. elegans supplementals' be the same, then, as just selecting 'C. elegans and Supplementals? I didn't see that in the search that I'm doing, but maybe there is another reason for that?

In principle, yes. However, "C. elegans and Supplementals" is a completely new document (it's a merged pdf), and the scoring might different as the length of a document factors in to the score.

Michael.

vanaukenk commented 3 years ago

Okay, got it. Thanks.

goldturtle commented 3 years ago

The default literature is now C. elegans for people who are not logged in.