SciencerIO / sciencer-toolkit

A smarter way to find articles.
MIT License
9 stars 1 forks source link

Fix results length to match `max_papers` #15

Closed PipzCorreiaz closed 2 years ago

PipzCorreiaz commented 2 years ago

We were assuming the results from the collect_by_terms were unique, which is not true if the collector finds repeated papers.

Instead of considering the length of the response data, we are now using the length of unique papers actually added to the resulting papers. This guarantees the number of results will always match the max_papers parameter.

ratuspro commented 2 years ago

Just to clarify, the requested changes apply to the provider, not the collector. The returned list of papers from get_paper_by_terms does not have duplicates.

Also, the assumption that the response from /paper/search?query=... from the Semantic Scholar's API does not return repeated papers is valid.

I tested the API and never came across a situation it returned duplicates of the same paper. @PipzCorreiaz @RicardoEPRodrigues are you sure this actually happens?

ratuspro commented 2 years ago

Indeed the Semantic Scholar's API return duplicate results on /paper/search?query=.

During the process I also realized that there is another constraint to the search process that need to be taken into account (reported on #16). Should be handled in a different PR.