D61-IA / stellar-gnosis

Gnosis paper management and collaboration tool
Apache License 2.0
0 stars 1 forks source link

Pagination for search results is broken #80

Closed PantelisElinas closed 4 years ago

PantelisElinas commented 4 years ago

Pagination for the index pages works correctly. However, pagination for search results is broken in the following way.

Let the number of items per page be set to 3. Also, let the number of results found for a search, e.g. searching for papers, returns 9 results. Then, we have 3 pages of results numbered 1, 2, 3.

The first page is correctly rendered showing the first 3 results returned for the user's search query.

If we click on page 2 or 3 or the Next button, we should expect to see the remaining results found. However, because viewing another page makes a GET request to the corresponding view, e.g., papers index view, then the results shown are not filtered based on the query but rather the top N results; for example, the papers() view returns,

Paper.objects.filter(is_public=True).order_by("-created_at")[:100]

The above bug exists for displaying the search results for all assets. Clearly, pagination for search results does not work correctly!

We can fix this using at least one of the below methods.

  1. Render results using a different template that does not use pagination. For example, searching papers renders paper_results.html instead of papers.html. We can remove pagination from the former and return a maximum number of results, e.g., 25. If more than 25 results were found, then display a message to inform the user and ask them to be more specific, e.g., "We found 34 results matching your query and showing 25. Please narrow your search."
  2. Move pagination to the client. Return all the results found, i.e., render the template for all results and with no pagination but use JS at the client side to display the results using pagination. Maybe https://pagination.js.org/ would be helpful for the latter. This might results in transferring a lot of data to the client for queries with many hits!

I will implement 1 as a temporary fix. @Zhenghao-Zhao you should consider the viability of option 2 while working on universal search.

Zhenghao-Zhao commented 4 years ago

The root problem with this issue is because the url for returning the search results is the same as the GET url for getting all results, therefore pagination always returns the all results. So another solution to this is to simply use a different url (e.g. with query string) for each search.

PantelisElinas commented 4 years ago

@Zhenghao-Zhao are you suggesting we pass the query string in the url and then search again in order to return the results for the n-th page? If so, wouldn't that have to be a POST instead of the GET the pagination module uses now? We probably should not be accepting GET requests that access the database as we don't know the the data passed is safe/valid.

Zhenghao-Zhao commented 4 years ago

No. I am suggesting we use a different url for different pagination for the search page. For example when you hover over 2 of the pagination, you should see a url in the form search/keywords/?page=2. Here keywords are what is in the form, we can use simple jQuery to map that to the url. So when the search results are returned, its pagination all has the similar form and are unique to its query string.

I'll explain how pagination works just in case we understood it differently: when you first search for something, it uses a POST request and it returns the results along with pagination that consists of buttons (1, 2, ... etc.). Each of those buttons is merely an anchor tag that is a GET request and its url is made of the url of the search page. So when you click on one of those buttons, it sends a GET request to that URL (with trailing ?page=N), because in the view function we simply returns everything when we receive not a POST request, the pagination would then proceed to return the N th page of everything. And this is why we had this issue.

If we include keywords as part of the url, it will then send a GET request to a different url in the form, that is unique to that search query string, e.g. search/keywords/?page=N, so it correctly returns the Nth page of all results for the query string, if we handle it correctly in the view function.

This is commonly deployed in many sites that use pagination, including GitHub, e.g. if you go to issues. I am not sure if I understood your concern with POST/GET request, but this is a embedded feature of pagination (i.e. GET request for different page of the results) not much we can do unless we do not use it.

Zhenghao-Zhao commented 4 years ago

I will provide a fix once I get back. Here Internet is very dodgy because of the Great Wall. Some of the common sites including stackoverflow, any google related sites even some CDNs do not work. I tried VPN but they are too slow to even load a single page.

PantelisElinas commented 4 years ago

@Zhenghao-Zhao let's discuss when you are back. We don't need this working this week.

Regards,

P.

PantelisElinas commented 4 years ago

Hi @Zhenghao-Zhao

the current develop branch, as of Sunday Jan 19th, has an implementation of the temporary fix I suggested above. Search results are limited to max 25 and if more than 25 results are available, then the user is informed and asked to narrow her search. There has also been some small updates to the Models to accomodate bulk uploading arXiv data (includes scripts for doing this.)

On the latter, I have already downloaded over 1 million paper metadata using the OAI interface. It take a few hours. If you need the data I can provide the csv file when I finish processing it. Best to not try and download it again so we are not abusing their server.

Regards,

P.

Zhenghao-Zhao commented 4 years ago

Hi Pantelis,

I think using GET request for searching will be safe for the server side. This is because in the process of retrieving searched results, all we do is looking at the keywords value in the GET request and retrieve any items that contain those keywords. There is no side effects related to this process. Also we are not executing those keywords, so problems such as SQL injection will not happen here.

The only security problem that GET has in our case I could think of is that it exposes keywords at the URL. But again, this does not affect us at the server.

I have not seen a website that does not use GET requests for search. If you could point out specifically your point of concern then I can look into it more carefully.

Cheers.