CottageLabs / LanternPM

Lantern meta repository for product management
1 stars 0 forks source link

Upcoming changes to EPMC search, may affect us #124

Closed emanuil-tolev closed 7 years ago

emanuil-tolev commented 7 years ago

Dear Europe PMC Web Service Users,

We have deployed the new web services (SOAP and Restful) version 4.5.3 to our test environment. Until the 14th October we will still have version 4.5.2 at our production server. Then we will deploy 4.5.3 there. Please, send us any concerns, if you need more time to adapt to the changes. Usually just switching to the version specific service should give you all the time you need. Find below the relevant URLs to the test and production version:

SOAP WSDL: Current production: http://www.ebi.ac.uk/europepmc/webservices/soap?wsdl Test version: http://www.ebi.ac.uk/europepmc/webservices/test/soap?wsdl Version specific reference (to current production or new test version as appropriate): http://www.ebi.ac.uk/europepmc/webservices/ver4.5.3/soap?wsdl or http://www.ebi.ac.uk/europepmc/webservices/ver4.5.2/soap?wsdl

REST URL Syntax Examples: Current production: http://www.ebi.ac.uk/europepmc/webservices/rest/search?query=0000-0002-1767-9318 Test tomcat-users-8-pg30.xmlversion: http://www.ebi.ac.uk/europepmc/webservices/test/rest/search?query=0000-0002-1767-9318 Version specific reference (to current production or new test version as appropriate): http://www.ebi.ac.uk/europepmc/webservices/ver4.5.3/rest/search?query=0000-0002-1767-9318 or http://www.ebi.ac.uk/europepmc/webservices/ver4.5.2/rest/search?query=0000-0002-1767-9318

The major change is the replacement of the request parameter “page" or “offset" with “cursorMark” for the SOAP method “searchPublications" and the RESTful method “search". With the switch from Lucene to SOLR the European PMC index gets continuously updated, which can cause weird order behaviour during pagination through the result list. Additionally, the usage of the “offset” / “page” request parameter for deep paging can cause performance issues. The background for this can be read in this very good article: https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results

For the SOAP “searchPublications” method, the parameter “cursorMark" is mandatory, on purpose. All clients have to be changed to adapt to this new way of pagination / requesting.

Here is an example of this new request format for the SOAP web service, by using SOAPUI:

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:web="http://webservice.cdb.ebi.ac.uk/">
   <soapenv:Header/>
   <soapenv:Body>
      <web:searchPublications>
         <!--Optional:-->
         <queryString>p51</queryString>
         <!--Optional:-->
         <resultType>lite</resultType>
         <!--Optional:-->
         <cursorMark>*</cursorMark>
         <!--Optional:-->
         <pageSize>20</pageSize>
         <sort>CITED asc</sort>
         <!--Optional:-->
         <synonym>false</synonym>
         <!--Optional:-->
         <email>my_email@mail.com</email>
      </web:searchPublications>
   </soapenv:Body>
</soapenv:Envelope>

The response looks like this:

<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
   <S:Header/>
   <S:Body>
      <ns2:searchPublicationsResponse xmlns:ns2="http://webservice.cdb.ebi.ac.uk/">
         <return>
            <version>4.5.3</version>
            <hitCount>6157</hitCount>
            <nextCursorMark>AoI/GzEwLjEwMDIvKFNJQ0kpMTUyMC02Nzc3KDE5OTYpMTU6MzwyMDM6OkFJRC1OQVU1PjMuMC5DTzsyLUooMTEyNDExMTk=</nextCursorMark>
            <request>
               <queryString>p51</queryString>
               <resultType>LITE</resultType>
               <cursorMark>*</cursorMark>
               <pageSize>20</pageSize>
               <sort>CITED asc</sort>
               <synonym>false</synonym>
               <email>my_email@mail.com</email>
            </request>
            <resultList>
               <result>
                  <id>21119085</id>
…

The first page can be requested with <cursorMark>*</cursorMark> (with the asterisk sign: *). In the response there is the element “nextCursorMark", whose value has to be used for the next page as the request parameter “cursorMark”. This is easier to demonstrate with the RESTful service: http://www.ebi.ac.uk/europepmc/webservices/test/rest/search/query=(p55)%20AND%20OPEN_ACCESS:Y&synonym=true&pageSize=100&sort=CITED%20desc&cursorMark=* The value from

``` <nextCursorMark>AoI4MTAuNDEwMy8xNjczLTUzNzQuMTI4MjQwKDMyOTUzMTA4</nextCursorMark> has to be copied in the next page request, as shown: http://www.ebi.ac.uk/europepmc/webservices/test/rest/search/query=(p55)%20AND%20OPEN_ACCESS:Y&synonym=true&pageSize=100&sort=CITED%20desc&cursorMark=AoI4MTAuNDEwMy8xNjczLTUzNzQuMTI4MjQwKDMyOTUzMTA4

There is a second request parameter, “sort”, and any single-valued field can be used here, e.g. P_PDATE, AUTH_FIRST, CITED etc. in “asc” or “desc” order. A RESTful example: http://www.ebi.ac.uk/europepmc/webservices/test/rest/search/query=(p55)%20AND%20OPEN_ACCESS:Y&synonym=true&pageSize=100&sort=CITED%20desc&cursorMark=*

Be aware that some fields are multivalued, even if it doesn’t look like it, e.g. TITLE: http://www.ebi.ac.uk/europepmc/webservices/test/rest/search/query=(p55)%20AND%20OPEN_ACCESS:Y&synonym=true&pageSize=100&sort=TITLE%20desc

The request then looks like:

<errorBean>
<errCode>404</errCode>
  <errMsg>
    org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at __SERVER__: can not sort on multivalued field: TITLE
  </errMsg>
</errorBean>

Kind regards, Europe PMC team

markmacgillivray commented 7 years ago

We should switch our epmc user to use cursorMark for pagination then. It wouldn't actually affect lantern, but needs doing for other potential uses.

On 6 Oct 2016 16:27, "Emanuil Tolev" notifications@github.com wrote:

Dear Europe PMC Web Service Users,

We have deployed the new web services (SOAP and Restful) version 4.5.3 to our test environment. Until the 14th October we will still have version 4.5.2 at our production server. Then we will deploy 4.5.3 there. Please, send us any concerns, if you need more time to adapt to the changes. Usually just switching to the version specific service should give you all the time you need. Find below the relevant URLs to the test and production version:

SOAP WSDL: Current production: http://www.ebi.ac.uk/europepmc/webservices/soap?wsdl Test version: http://www.ebi.ac.uk/europepmc/webservices/test/soap?wsdl Version specific reference (to current production or new test version as appropriate): http://www.ebi.ac.uk/europepmc/webservices/ver4.5. 3/soap?wsdl or http://www.ebi.ac.uk/europepmc/webservices/ver4.5.2/soap?wsdl

REST URL Syntax Examples: Current production: http://www.ebi.ac.uk/europepmc/webservices/rest/ search?query=0000-0002-1767-9318 Test tomcat-users-8-pg30.xmlversion: http://www.ebi.ac.uk/ europepmc/webservices/test/rest/search?query=0000-0002-1767-9318 Version specific reference (to current production or new test version as appropriate): http://www.ebi.ac.uk/europepmc/webservices/ver4.5. 3/rest/search?query=0000-0002-1767-9318 or http://www.ebi.ac.uk/europepmc/webservices/ver4.5. 2/rest/search?query=0000-0002-1767-9318

The major change is the replacement of the request parameter “page" or “offset" with “cursorMark” for the SOAP method “searchPublications" and the RESTful method “search". With the switch from Lucene to SOLR the European PMC index gets continuously updated, which can cause weird order behaviour during pagination through the result list. Additionally, the usage of the “offset” / “page” request parameter for deep paging can cause performance issues. The background for this can be read in this very good article: https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results

For the SOAP “searchPublications” method, the parameter “cursorMark" is mandatory, on purpose. All clients have to be changed to adapt to this new way of pagination / requesting.

Here is an example of this new request format for the SOAP web service, by using SOAPUI:

soapenv:Header/ soapenv:Body web:searchPublications p51 lite * 20 CITED asc false my_email@mail.com /web:searchPublications /soapenv:Body /soapenv:Envelope The response looks like this: 4.5.3 6157 AoI/GzEwLjEwMDIvKFNJQ0kpMTUyMC02Nzc3KDE5OTYpMTU6MzwyMDM6OkFJRC1OQVU1PjMuMC5DTzsyLUooMTEyNDExMTk= p51 LITE * 20 CITED asc false my_email@mail.com 21119085 … The first page can be requested with _ (with the asterisk sign: *). In the response there is the element “nextCursorMark", whose value has to be used for the next page as the request parameter “cursorMark”. This is easier to demonstrate with the RESTful service: http://www.ebi.ac.uk/europepmc/webservices/test/ rest/search/query=(p55)%20AND%20OPEN_ACCESS:Y&synonym=true& pageSize=100&sort=CITED%20desc&cursorMark=_ The value from has to be copied in the next page request, as shown:http://www.ebi.ac.uk/europepmc/webservices/test/rest/search/query=(p55)%20AND%20OPEN_ACCESS:Y&synonym=true&pageSize=100&sort=CITED%20desc&cursorMark=AoI4MTAuNDEwMy8xNjczLTUzNzQuMTI4MjQwKDMyOTUzMTA4 There is a second request parameter, “sort”, and any single-valued field can be used here, e.g. P_PDATE, AUTH_FIRST, CITED etc. in “asc” or “desc” order. A RESTful example: http://www.ebi.ac.uk/europepmc/webservices/test/rest/search/query=(p55)%20AND%20OPEN_ACCESS:Y&synonym=true&pageSize=100&sort=CITED%20desc&cursorMark=* Be aware that some fields are multivalued, even if it doesn’t look like it, e.g. TITLE: http://www.ebi.ac.uk/europepmc/webservices/test/rest/search/query=(p55)%20AND%20OPEN_ACCESS:Y&synonym=true&pageSize=100&sort=TITLE%20desc The request then looks like: ``` xml 404 org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at __SERVER__: can not sort on multivalued field: TITLE Kind regards, Europe PMC team — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread . ```
richard-jones commented 7 years ago

I've tagged this high priority just to make sure that we've reviewed it and know whether we need to do something or not. If we can get ahead of breaking changes to the EPMC connection, that would be good. If this is done, or not relevant, feel free to close.

richard-jones commented 7 years ago

@markmacgillivray can this be closed?

markmacgillivray commented 7 years ago

It has not been done, but does not affect lantern, so can close.