Toniiiio / sivis

Turn browser clicks into reproducible scraping code.
10 stars 1 forks source link

Data is spread across multiple requests - find correct request #13

Open Toniiiio opened 4 years ago

Toniiiio commented 4 years ago

On certain pages the target data is spread over multiple pages or ajax requests. It can be the case that the first results are within the html document and the additional data is loaded via ajax, so part of another request.

Most often it should be the case, that the data from the html document is also accessible via ajax, so that the ajax request should be preferred.

To catch the ajax request, the user has to select data from this request. He will be informed to do so, in the documentation (and on the readme.md on github).

In case target data is spread across multliple pages / requests

  • choose as many data as possible
  • start on the second+ page. (first page data might be in another request)
  • check if first page data can be derived from the requests that yield the 2nd, 3rd,.. pages (often the case).

Next, it has to be ensured that the ajax request is preferred over the html document request (does this relation always hold(?)).

1) As the data is (most often?) ordered in a way that first the html document data appears and is then followed by the ajax request data, the later target data should be prioritized for the identification of the correct request. 2) Also, if multiple requests fulfill the criteria for being the correct source, ajax requests could be prioritized.

Note, that if we take 2), 1) could still be relevant to identify the ajax request.

Dependencies:

This issue is in partial conflict with another request: Cutting amount of target values due to performance issues. If there are too many target values performance suffers, due to functions like get_xpath_by_tag.

Potential solution

Let the user choose a maximum of target values (n) to analyse. (Currently already implemented). But should be higher than 10, because 10 could be the amount of values in the html document. And that would leave out the ajax request data.

Then take the last(!) n target values on both sides, in R and within chrome.