allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
188 stars 29 forks source link

Paper Recommendations API appears to be missing some relevant papers. #134

Closed de-code closed 1 year ago

de-code commented 1 year ago

Describe the bug Using the embedding vectors provided by S2, I performed a vector search for a vector picked from a single paper. I found the papers recommended by S2 (I only looked at those with the DOI prefix 10.1101). But I also found papers using the vector search that were not recommended by the Paper Recommendations API. Those papers are still in S2. That seemed like a bug?

To Reproduce Query DOI: 10.1101/2023.06.26.546135 URL: https://api.semanticscholar.org/recommendations/v1/papers/forpaper/DOI:10.1101/2023.06.26.546135?fields=externalIds&limit=500

DOIs with prefix 10.1101 recommended:

['10.1101/2023.07.07.548149',
 '10.1101/2023.07.14.548872',
 '10.1101/2023.07.31.23293441',
 '10.1101/2023.07.17.549345',
 '10.1101/2023.07.21.23292994',
 '10.1101/2023.07.24.548988',
 '10.1101/2023.06.21.545948',
 '10.1101/2023.06.15.23290026',
 '10.1101/2023.07.27.550860',
 '10.1101/2023.07.14.548928',
 '10.1101/2023.06.22.546003',
 '10.1101/2023.07.11.548645',
 '10.1101/2023.06.14.544764',
 '10.1101/2023.07.27.23292878',
 '10.1101/2023.07.31.551312',
 '10.1101/2023.07.21.548740',
 '10.1101/2023.07.22.550159',
 '10.1101/2023.08.02.551154',
 '10.1101/2023.06.30.547218',
 '10.1101/2023.07.14.23292649',
 '10.1101/2023.07.06.547955',
 '10.1101/2023.06.15.23291445',
 '10.1101/2023.07.17.549430',
 '10.1101/2023.06.20.545693',
 '10.1101/2023.07.11.548585',
 '10.1101/2023.06.23.23291827',
 '10.1101/2023.06.27.546779',
 '10.1101/2023.06.29.547130',
 '10.1101/2023.06.28.546960',
 '10.1101/2023.07.11.548555',
 '10.1101/2023.07.25.550584',
 '10.1101/2023.06.23.546265',
 '10.1101/2023.07.07.23292352',
 '10.1101/2023.06.27.546753',
 '10.1101/2023.06.30.546935',
 '10.1101/2023.07.10.548260',
 '10.1101/2023.06.28.546833',
 '10.1101/2023.07.21.549987',
 '10.1101/2023.07.06.547908']

Some of the DOIs that I found via vector search that are not in the above list:

['10.1101/2023.07.27.23293177',
 '10.1101/2023.06.14.544834',
 '10.1101/2023.06.14.544911',
 '10.1101/2023.08.01.551588',
 '10.1101/2023.07.31.551349',
 '10.1101/2023.06.22.546204',
 '10.1101/2023.06.07.23291074',
 '10.1101/2023.07.13.548931',
 '10.1101/2023.07.18.549515',
 '10.1101/2023.07.28.551016',

At least the top five of those are currently available via S2.

10.1101/2023.07.27.23293177 for example was published 2023-07-31 (not too recent). 10.1101/2023.06.14.544834 was published 2023-06-15 (54 days before today)

Expected behavior I would expect S2 to match the papers found via vector search, except where data may have changed or are borderline the 60 day window.

cfiorelli commented 1 year ago

@de-code There's some room as I understand it for how you utilize embeddings to generate recommendations to cause some differences in results when compared to the recommendations API. If the deltas you're observing seem more concerning than what this thinking might explain we could hop on a call and review the process together. Thank you!

de-code commented 1 year ago

@cfiorelli I am exploring different ways of using the S2 API or data. In part also to workaround some limitations. In this case I was only using the (SPECTER 1) embeddings provided by the S2 Graph API. The order is otherwise consistent. To me it looks like some relevant papers were just not considered by the recommendations. That is even though at least some of those missing paper were published within the 60 day window (but not too recent either).

For an end user use-case it means the recommendations seem to miss out some relevant papers (and that may include papers that should have been ranked quite highly).

I'd be more than happy to hop on a call. (I am in UK timezone)

yoganandc commented 1 year ago

Hi!

Its been over 60 days since the example you provided initially, so its difficult for me to recreate this scenario. I could use some more information. Specifically:

I do want to note that differences shouldn't be surprising given that it's unlikely your setup matches our internal setup for recommendations.

de-code commented 1 year ago

Hi,

I've run the comparison again. This time there aren't many missing papers for the same DOI (10.1101/2023.06.26.546135).

The only paper that appears to be missing is: 10.1101/2023.09.06.556476

Via https://api.semanticscholar.org/graph/v1/paper/DOI:10.1101/2023.09.06.556476?fields=externalIds,venue,publicationVenue,publicationTypes,publicationDate I can see it still is in S2:

{
  "paperId": "73789e2c6aa78e1c2b85fc1f4051577a487f66a2",
  "externalIds": {
    "DOI": "10.1101/2023.09.06.556476",
    "CorpusId": 261661823
  },
  "publicationVenue": {
    "id": "027ffd21-ebb0-4af8-baf5-911124292fd0",
    "name": "bioRxiv",
    "type": "journal",
    "url": "http://biorxiv.org/"
  },
  "venue": "bioRxiv",
  "publicationTypes": null,
  "publicationDate": "2023-09-07"
}

Based on the embedding vectors, that is the preprint with rank 7th:

['10.1101/2023.08.24.554055',
 '10.1101/2023.09.12.557484',
 '10.1101/2023.09.05.556422',
 '10.1101/2023.09.08.556871',
 '10.1101/2023.09.15.557832',
 '10.1101/2023.10.05.561142',
 '10.1101/2023.09.06.556476',
 '10.1101/2023.08.28.555037',
 '10.1101/2023.08.15.23294128',
 '10.1101/2023.10.02.560422',
 '10.1101/2023.08.29.555368',
 '10.1101/2023.09.05.556397',
 '10.1101/2023.09.21.558743',
 '10.1101/2023.09.15.558006',
 '10.1101/2023.08.18.553846',
 '10.1101/2023.09.05.556399',
 '10.1101/2023.08.14.553289',
...

I could compare related papers for other DOIs if that was useful.

cmwilhelm commented 1 year ago

Hi @de-code,

Thanks for following up. For single-paper recommendation we use an approximate K-nearest neighbor index. We have verified the paper above (10.1101/2023.09.06.556476) is present in the index, and thus eligible for recommendation, but is not being recommended for your probe paper (10.1101/2023.06.26.546135).

This could be due to the fact that the algorithm is only approximate rather than exhaustive (a necessity in order to provide this service at low latency and scale), or perhaps because you are using a different distance calculation (our index uses L2, thus you may see discrepancy if you are using cosine distance).

Please also note that our index only includes papers for which S2 has abstract text. Although we provide embeddings for papers with unknown abstracts, these are of lower quality, and we do not use them in this particular application.

Best, Chris

de-code commented 1 year ago

Thank you for looking into it and explaining it.