Closed de-code closed 1 year ago
@de-code There's some room as I understand it for how you utilize embeddings to generate recommendations to cause some differences in results when compared to the recommendations API. If the deltas you're observing seem more concerning than what this thinking might explain we could hop on a call and review the process together. Thank you!
@cfiorelli I am exploring different ways of using the S2 API or data. In part also to workaround some limitations. In this case I was only using the (SPECTER 1) embeddings provided by the S2 Graph API. The order is otherwise consistent. To me it looks like some relevant papers were just not considered by the recommendations. That is even though at least some of those missing paper were published within the 60 day window (but not too recent either).
For an end user use-case it means the recommendations seem to miss out some relevant papers (and that may include papers that should have been ranked quite highly).
I'd be more than happy to hop on a call. (I am in UK timezone)
Hi!
Its been over 60 days since the example you provided initially, so its difficult for me to recreate this scenario. I could use some more information. Specifically:
I do want to note that differences shouldn't be surprising given that it's unlikely your setup matches our internal setup for recommendations.
Hi,
I've run the comparison again. This time there aren't many missing papers for the same DOI (10.1101/2023.06.26.546135
).
The only paper that appears to be missing is: 10.1101/2023.09.06.556476
Via https://api.semanticscholar.org/graph/v1/paper/DOI:10.1101/2023.09.06.556476?fields=externalIds,venue,publicationVenue,publicationTypes,publicationDate
I can see it still is in S2:
{
"paperId": "73789e2c6aa78e1c2b85fc1f4051577a487f66a2",
"externalIds": {
"DOI": "10.1101/2023.09.06.556476",
"CorpusId": 261661823
},
"publicationVenue": {
"id": "027ffd21-ebb0-4af8-baf5-911124292fd0",
"name": "bioRxiv",
"type": "journal",
"url": "http://biorxiv.org/"
},
"venue": "bioRxiv",
"publicationTypes": null,
"publicationDate": "2023-09-07"
}
Based on the embedding vectors, that is the preprint with rank 7th:
['10.1101/2023.08.24.554055',
'10.1101/2023.09.12.557484',
'10.1101/2023.09.05.556422',
'10.1101/2023.09.08.556871',
'10.1101/2023.09.15.557832',
'10.1101/2023.10.05.561142',
'10.1101/2023.09.06.556476',
'10.1101/2023.08.28.555037',
'10.1101/2023.08.15.23294128',
'10.1101/2023.10.02.560422',
'10.1101/2023.08.29.555368',
'10.1101/2023.09.05.556397',
'10.1101/2023.09.21.558743',
'10.1101/2023.09.15.558006',
'10.1101/2023.08.18.553846',
'10.1101/2023.09.05.556399',
'10.1101/2023.08.14.553289',
...
I could compare related papers for other DOIs if that was useful.
Hi @de-code,
Thanks for following up. For single-paper recommendation we use an approximate K-nearest neighbor index. We have verified the paper above (10.1101/2023.09.06.556476) is present in the index, and thus eligible for recommendation, but is not being recommended for your probe paper (10.1101/2023.06.26.546135).
This could be due to the fact that the algorithm is only approximate rather than exhaustive (a necessity in order to provide this service at low latency and scale), or perhaps because you are using a different distance calculation (our index uses L2, thus you may see discrepancy if you are using cosine distance).
Please also note that our index only includes papers for which S2 has abstract text. Although we provide embeddings for papers with unknown abstracts, these are of lower quality, and we do not use them in this particular application.
Best, Chris
Thank you for looking into it and explaining it.
Describe the bug Using the embedding vectors provided by S2, I performed a vector search for a vector picked from a single paper. I found the papers recommended by S2 (I only looked at those with the DOI prefix
10.1101
). But I also found papers using the vector search that were not recommended by the Paper Recommendations API. Those papers are still in S2. That seemed like a bug?To Reproduce Query DOI:
10.1101/2023.06.26.546135
URL:https://api.semanticscholar.org/recommendations/v1/papers/forpaper/DOI:10.1101/2023.06.26.546135?fields=externalIds&limit=500
DOIs with prefix
10.1101
recommended:Some of the DOIs that I found via vector search that are not in the above list:
At least the top five of those are currently available via S2.
10.1101/2023.07.27.23293177
for example was published2023-07-31
(not too recent).10.1101/2023.06.14.544834
was published2023-06-15
(54 days before today)Expected behavior I would expect S2 to match the papers found via vector search, except where data may have changed or are borderline the 60 day window.