Closed wammar closed 11 months ago
TODO:
@wammar I removed this from on-call so the API team can determine how critical it is first.
We are missing less than .5% of PMC papers
SELECT count(*) from sourced_papers
WHERE source_id like 'PubMedCentral%'
5224813
SELECT count(distinct corpus_paper_id) FROM paper_sources
WHERE source = 'PubMedCentral'
5213326
About 7% of them are missing from s2orc, but this is primarily because of failure to process the PDFs. https://github.com/allenai/scholar/issues/32425
Assigning back to @wammar for prioritization. It would be a Small+ to try reprocessing those 10k missing PMC papers.
Thank you @rodneykinney for digging out the counts of pubmed sourced papers with vs. without a canonical paper ID. Super helpful, as always! Quick follow up: I'm curious why are we using a more permissive WHERE clause in the first query but not in the second. Asking in case the different condition might explain the diff?
Given the diff is only 0.5%, I suggest we deprioritize this for now but I'd like to understand our current hypothesis on why we're missing these in order to estimate how the risk is going to play out in the future (e.g., are we missing these papers altogether, or just missing the PMID? Are we likely to see a bigger or smaller percentage of dropped PMIDs in the future?)
Let's take PMID:34995702 as an example. Looking up the title of that missing PMID, I was able to find it in our corpus with paperId=46a6e1df7b39e93e5fa956184cc5ca68d5e6e607
and it has the right DOI and title (happy face) and no PMID (expected but still sad face):
// https://api.semanticscholar.org/graph/v1/paper/46a6e1df7b39e93e5fa956184cc5ca68d5e6e607?fields=externalIds
{
"paperId": "46a6e1df7b39e93e5fa956184cc5ca68d5e6e607",
"externalIds": {
"DOI": "10.1016/j.jad.2022.01.024",
"CorpusId": 245714377
},
"title": "Prevalence of suicidal ideation and suicide attempt among patients with traumatic brain injury: a meta-analysis"
}
Based on my understanding, this PMID is most likely missing because we weren't able to cluster it with the other sourced_papers which provide different representations for the same paper, hence marking this issue as a bug (@rodneykinney would you agree?) The good news is that we're not missing this paper completely since we get different representations of pubmed papers from different sources. I sampled a few more examples from the list and found the same pattern: S2 has the paper with the correct DOI and title in pubmed, but we're missing the PMID.
The WHERE
clauses are different because in one table the source is an enum, but in the other table the source is a prefix in the ID string.
If a PMID is missing from our corpus entirely, it's because we dropped it from the DAQ for some reason. We retry on intermittent errors, so the most likely reason is that we failed to parse the XML. If we ingested the paper from PMC but clustered it incorrectly, then the PMID would be present in the corpus, as a duplicate.
In the case of PMID:34995702
we only get one hit searching for the title, so we must have dropped the PMC paper from the DAQ. So the error is mitigated by the fact that we do have the paper, but anybody attempting to look it up by its PMID won't be able to.
Seems to be resolved EDIT: not resolved
Review with Rodney, finding that .5% is within our current tolerance for this type of issue.
Ken Church: