Update CollectByTerms to reduce requests

PipzCorreiaz commented 2 years ago

Currently the CollectByTerms performs 1 request to the API (get_paper_by_terms), which returns N papers according to the query. Then, it performs N requests to get all the information of each paper (get_paper_by_id). Although there is a parameter to choose the output fields of the first query, there is no way to retrieve the ids of references and citations. I believe that is the reason for having those N extra requests inside the CollectByTerms.

I would like to suggest or discuss the possibility of not doing those N extra requests inside the CollectByTerms and, probably, only perform them in the expanders (where they will be needed?!). I think Sciencer was designed having in mind a pipeline that starts with a small set of papers, then it could be expanded and possibly be filtered. However, another possibility is to start with a huge set of papers and then start filtering and possibly expanding later. In those cases doing so many requests takes a huge time and most of those papers will probably be filtered afterwards, which means wasted time.

ratuspro commented 2 years ago

Indeed Sciencer was designed to expand a small set of paper. However, as you suggested, developers can use it to filter a large set of papers without any expansion at all. In the latter case, performing at least 1 request for each collected is a huge drawback.

Towards reducing the load in such cases we can delay fetching the info about citations and references to when they are actually requested. This will require adding lazy loading to the paper model.

ratuspro commented 2 years ago

This issue is being solved on #30.

ratuspro commented 2 years ago

Closed on #30

SciencerIO / sciencer-toolkit

Update CollectByTerms to reduce requests #29