ezpaarse-project / ezpaarse-platforms

Platforms parsers, scrapers and PKBs for ezPAARSE
11 stars 27 forks source link

Proquest redirects preventing access events from being registered #666

Open chryslovelace opened 1 year ago

chryslovelace commented 1 year ago

We have an issue where access events for Proquest are not being registered due to their use of redirects. Here are some snippets of sessions that demonstrate this issue:

RoEEmPk8iBS5zWn [29/Nov/2021:22:41:55 -0500] "GET https://www.proquest.com:443/docview/1475116173?pq-origsite=primo&accountid=14709 HTTP/1.1" 302 0
RoEEmPk8iBS5zWn [29/Nov/2021:22:41:56 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy HTTP/1.1" 302 0
RoEEmPk8iBS5zWn [29/Nov/2021:22:41:56 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy/advanced HTTP/1.1" 200 1798

rVw0khgPBiBPJeX [29/Nov/2021:08:38:53 -0500] "GET https://www.proquest.com:443/docview/1819126361?pq-origsite=primo HTTP/1.1" 302 0
rVw0khgPBiBPJeX [29/Nov/2021:08:38:53 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy HTTP/1.1" 302 0
rVw0khgPBiBPJeX [29/Nov/2021:08:38:53 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy/advanced HTTP/1.1" 200 1782

KnObbZigVCjacdA [27/Nov/2021:23:50:49 -0500] "GET https://www.proquest.com:443/docview/1295901959?pq-origsite=primo&accountid=14709 HTTP/1.1" 302 0
KnObbZigVCjacdA [27/Nov/2021:23:50:50 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy HTTP/1.1" 302 0
KnObbZigVCjacdA [27/Nov/2021:23:50:50 -0500] "GET https://www.proquest.com:443/intermediateredirectforezproxy/advanced HTTP/1.1" 200 1798

The url in each of the first lines here includes the document id, and proquest/parser.js seems like it should be picking up this url format, but they are presumably being ignored due to the 302 redirect and/or empty content. The actual content is delivered in the third request, but the id is no longer present in the url to be extracted, so the access event can't be properly registered.

In some previous correspondence our organization had asked whether multiple lines could be combined to make a determination of an access event and the response was that it was not possible. Is this still the case given this issue? If not, is there a way that the initial request here can count as the access event, so those identifiers can be extracted?

tporquet commented 1 year ago

Hello and sorry for the delay of our reaction... Those lines are indeed ignored by ezpaarse by default. You could setup ezpaarse globally not to ignore 302 status lines but it is a global parameter, see: https://ezpaarse-project.github.io/ezpaarse/configuration/parametres.html#ezpaarse-filter-status We are thinking about allowing that feature on a parser basis (instead of a global parameter) to keep the processing load as low as possible (in a typical log file, we filter out 90-95% of the log lines)

tporquet commented 1 year ago

As for you second question of combining multiple lines to make a determination of an access event, which is obviously linked to the 302 situation, we are also thinking on either:

NB: The only usecase where we keep a memory of a previous access event is for the counter deduplication algorithm where we filter access events if the same resources is accessed by the same user-session or user-id in a short timespan (10 to 30 seconds, depending on the resource format).