ezpaarse-project / ezpaarse-platforms

Platforms parsers, scrapers and PKBs for ezPAARSE
11 stars 26 forks source link

OCLC FirstSearch parse unexpectedly populates result.login #381

Closed ctgraham closed 3 years ago

ctgraham commented 3 years ago

The OCLC FirstSearch parser sets result.login: https://github.com/ezpaarse-project/ezpaarse-platforms/blob/1b1b4f018fe55b793de6127620d75f652113eb5f/oclc-fs/parser.js#L30 https://github.com/ezpaarse-project/ezpaarse-platforms/blob/1b1b4f018fe55b793de6127620d75f652113eb5f/oclc-fs/parser.js#L144

No other parser was observed to do this.

The effect of the assignment of result.login is that the actual login of the user, if present in the log, is overwritten.

Consider test.log as:

192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:42 -0400] "GET https://pitt.idm.oclc.org:8443/connect?session=swZOoMiUxUCi1pBa&qurl=http%3a%2f%2ffirstsearch.oclc.org%2ffsip%3f%26dbname%3dWorldCat%26done%3dreferer HTTP/1.1" 302 0
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:42 -0400] "GET http://firstsearch.oclc.org:80/fsip?&dbname=WorldCat&done=referer HTTP/1.1" 301 0
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:43 -0400] "GET https://firstsearch.oclc.org:443/fsip?&dbname=WorldCat&done=referer HTTP/1.1" 302 611
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:43 -0400] "GET https://firstsearch.oclc.org:443/html/webscript.html:%3Asessionid=fsapp1-42029-k8dgm19o-rahj6v:sessionid=fsapp1-42029-k8dgm19o-rahj6v: HTTP/1.1" 200 27944
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:43 -0400] "GET https://firstsearch.oclc.org:443/html/print.css HTTP/1.1" 200 217
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:43 -0400] "GET https://firstsearch.oclc.org:443/css/common.css HTTP/1.1" 200 1215
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:43 -0400] "GET https://firstsearch.oclc.org:443/javascript/misc.js HTTP/1.1" 200 1030
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:43 -0400] "GET https://firstsearch.oclc.org:443/javascript/calendar.js HTTP/1.1" 200 30447
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:43 -0400] "GET https://fonts.googleapis.com:443/css?family=Roboto HTTP/1.1" 200 2962
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:43 -0400] "GET https://firstsearch.oclc.org:443/images/fs2x2.gif HTTP/1.1" 200 187
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:43 -0400] "GET https://firstsearch.oclc.org:443/images/fs_info.gif HTTP/1.1" 200 189
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:43 -0400] "GET https://firstsearch.oclc.org:443/WebZ/FSPrefs?entityjsdetect=:javascript=true:screensize=large:sessionid=fsapp1-42029-k8dgm19o-rahj6v:entitypagenum=1:0 HTTP/1.1" 200 38920
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/fs2x2.gif HTTP/1.1" 200 230
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/fs_info.gif HTTP/1.1" 200 1535
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/nfs_news.gif HTTP/1.1" 200 495
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/ar16x40.gif HTTP/1.1" 200 1094
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/fs_helpsmall.gif HTTP/1.1" 200 2231
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/nfs_help.gif HTTP/1.1" 200 2231
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/fs_infosmall.gif HTTP/1.1" 200 1535
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/worldcat_72x22.gif HTTP/1.1" 200 1740
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/ja16x45.gif HTTP/1.1" 200 1129
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/ko16x47.gif HTTP/1.1" 200 361
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/zh16x79.gif HTTP/1.1" 200 1208
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/zs16x79.gif HTTP/1.1" 200 1199
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/images/oclc_logo67x36.gif HTTP/1.1" 200 1788
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:44 -0400] "GET https://firstsearch.oclc.org:443/favicon.ico HTTP/1.1" 404 332
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:53 -0400] "POST https://firstsearch.oclc.org:443/WebZ/FSQUERY?format=BI:next=html/records.html:bad=html/records.html:numrecs=10:sessionid=fsapp1-42029-k8dgm19o-rahj6v:entitypagenum=2:0:searchtype=basic HTTP/1.1" 200 66621
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/fs_sort.gif HTTP/1.1" 200 817
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/fs_relatedsubjects.gif HTTP/1.1" 200 694
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/fs_relatedauthors.gif HTTP/1.1" 200 1931
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/fs_narrow.gif HTTP/1.1" 200 891
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/nfs_email.gif HTTP/1.1" 200 1378
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/fs_print.gif HTTP/1.1" 200 954
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/fs_export.gif HTTP/1.1" 200 1194
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/nfs_prev.gif HTTP/1.1" 200 1112
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/nfs_next.gif HTTP/1.1" 200 1094
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/icon-bks24.gif HTTP/1.1" 200 1673
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/fs_getit.gif HTTP/1.1" 200 353
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/fs_libowns.gif HTTP/1.1" 200 857
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/icon-url24.gif HTTP/1.1" 200 1591
192.168.0.1 USERNAME@pitt.edu wZOoMiUxUCi1pBa [29/Mar/2020:15:51:54 -0400] "GET https://firstsearch.oclc.org:443/images/icon-com.gif HTTP/1.1" 200 985

processed via:

curl -s -X POST --no-buffer -H 'Reject-Files: all' -H 'Crypted-Fields: none' -H 'Log-Format-ezproxy: %h %u %{ezproxy-session}i %t "%r" %s %b' -H 'Date-Format: DD/MMM/YYYY:HH:mm:ss Z' -H 'Connection: keep-alive' --data-binary @/tmp/test.log http://localhost:59599 -o /tmp/test.results -D /tmp/test.headers

processes to test.results with a login of "fsapp1-42029-k8dgm19o-rahj6v":

datetime;date;login;platform;platform_name;publisher_name;rtype;mime;print_identifier;online_identifier;title_id;doi;publication_title;publication_date;unitid;domain;on_campus;log_id;ezpaarse_version;ezpaarse_date;middlewares_version;middlewares_date;platforms_version;platforms_date;middlewares;title;type;subject;geoip-country;geoip-latitude;geoip-longitude;host;ezproxy-session;url;status;size
2020-03-29T19:51:53+00:00;2020-03-29;fsapp1-42029-k8dgm19o-rahj6v;oclc-fs;OCLC Firstsearch;;SEARCH;HTML;;;;;;;;firstsearch.oclc.org;Y;00d54a84c9b665a46704570b1f8d366afc77f001;;;6e43d8b;2021-04-13;909d570;2021-04-14;filter, parser, deduplicator, istex, crossref, sudoc, hal, enhancer, geolocalizer, cut, on-campus-counter, qualifier, anonymizer;;;;;;;192.168.0.1;wZOoMiUxUCi1pBa;https://firstsearch.oclc.org:443/WebZ/FSQUERY?format=BI:next=html/records.html:bad=html/records.html:numrecs=10:sessionid=fsapp1-42029-k8dgm19o-rahj6v:entitypagenum=2:0:searchtype=basic;200;66621

when the expected login would be "USERNAME@pitt.edu":

datetime;date;login;platform;platform_name;publisher_name;rtype;mime;print_identifier;online_identifier;title_id;doi;publication_title;publication_date;unitid;domain;on_campus;log_id;ezpaarse_version;ezpaarse_date;middlewares_version;middlewares_date;platforms_version;platforms_date;middlewares;title;type;subject;geoip-country;geoip-latitude;geoip-longitude;host;ezproxy-session;url;status;size
2020-03-29T19:51:53+00:00;2020-03-29;USERNAME@pitt.edu;oclc-fs;OCLC Firstsearch;;SEARCH;HTML;;;;;;;;firstsearch.oclc.org;Y;00d54a84c9b665a46704570b1f8d366afc77f001;;;6e43d8b;2021-04-13;909d570;2021-04-14;filter, parser, deduplicator, istex, crossref, sudoc, hal, enhancer, geolocalizer, cut, on-campus-counter, qualifier, anonymizer;;;;;;;192.168.0.1;wZOoMiUxUCi1pBa;https://firstsearch.oclc.org:443/WebZ/FSQUERY?format=BI:next=html/records.html:bad=html/records.html:numrecs=10:sessionid=fsapp1-42029-k8dgm19o-rahj6v:entitypagenum=2:0:searchtype=basic;200;66621
librarywebchic commented 3 years ago

Fixed in - https://github.com/ezpaarse-project/ezpaarse-platforms/pull/387

ctgraham commented 3 years ago

Verified resolved in 3.6.5.