allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
166 stars 28 forks source link

A bunch of PubMed IDs (PMIDs) are missing. Why? #22

Closed wammar closed 11 months ago

wammar commented 1 year ago

Ken Church:

A colleague asked me to do something with about 1k PMIDs. I found most of them in semantic scholar, but not these:

PMID:34995702
PMID:34457137
PMID:34831513
PMID:32662296
PMID:34932685
PMID:35015933
PMID:25666784
PMID:33226074
wammar commented 1 year ago

TODO:

  1. Determine how critical this is: Athena query to identify pubmed IDs for which we have the source but they don't belong to any paper cluster. Rodney suggests forwarding this to data on-call.
  2. Resend them to the paper clustering system to see if they get assigned a cluster.
robe-ai2 commented 1 year ago

@wammar I removed this from on-call so the API team can determine how critical it is first.

rodneykinney commented 1 year ago

We are missing less than .5% of PMC papers

SELECT count(*) from sourced_papers 
WHERE source_id like 'PubMedCentral%'

5224813
SELECT count(distinct corpus_paper_id) FROM paper_sources
WHERE source = 'PubMedCentral'

5213326

About 7% of them are missing from s2orc, but this is primarily because of failure to process the PDFs. https://github.com/allenai/scholar/issues/32425

rodneykinney commented 1 year ago

Assigning back to @wammar for prioritization. It would be a Small+ to try reprocessing those 10k missing PMC papers.

wammar commented 1 year ago

Thank you @rodneykinney for digging out the counts of pubmed sourced papers with vs. without a canonical paper ID. Super helpful, as always! Quick follow up: I'm curious why are we using a more permissive WHERE clause in the first query but not in the second. Asking in case the different condition might explain the diff?

Given the diff is only 0.5%, I suggest we deprioritize this for now but I'd like to understand our current hypothesis on why we're missing these in order to estimate how the risk is going to play out in the future (e.g., are we missing these papers altogether, or just missing the PMID? Are we likely to see a bigger or smaller percentage of dropped PMIDs in the future?)

Let's take PMID:34995702 as an example. Looking up the title of that missing PMID, I was able to find it in our corpus with paperId=46a6e1df7b39e93e5fa956184cc5ca68d5e6e607 and it has the right DOI and title (happy face) and no PMID (expected but still sad face):

// https://api.semanticscholar.org/graph/v1/paper/46a6e1df7b39e93e5fa956184cc5ca68d5e6e607?fields=externalIds
{
  "paperId": "46a6e1df7b39e93e5fa956184cc5ca68d5e6e607",
  "externalIds": {
    "DOI": "10.1016/j.jad.2022.01.024",
    "CorpusId": 245714377
  },
  "title": "Prevalence of suicidal ideation and suicide attempt among patients with traumatic brain injury: a meta-analysis"
}

Based on my understanding, this PMID is most likely missing because we weren't able to cluster it with the other sourced_papers which provide different representations for the same paper, hence marking this issue as a bug (@rodneykinney would you agree?) The good news is that we're not missing this paper completely since we get different representations of pubmed papers from different sources. I sampled a few more examples from the list and found the same pattern: S2 has the paper with the correct DOI and title in pubmed, but we're missing the PMID.

rodneykinney commented 1 year ago

The WHERE clauses are different because in one table the source is an enum, but in the other table the source is a prefix in the ID string.

If a PMID is missing from our corpus entirely, it's because we dropped it from the DAQ for some reason. We retry on intermittent errors, so the most likely reason is that we failed to parse the XML. If we ingested the paper from PMC but clustered it incorrectly, then the PMID would be present in the corpus, as a duplicate.

In the case of PMID:34995702 we only get one hit searching for the title, so we must have dropped the PMC paper from the DAQ. So the error is mitigated by the fact that we do have the paper, but anybody attempting to look it up by its PMID won't be able to.

cfiorelli commented 12 months ago

Seems to be resolved EDIT: not resolved

cfiorelli commented 11 months ago

Review with Rodney, finding that .5% is within our current tolerance for this type of issue.