Authentication Changes--Problem with Related Documents

nrshapiro commented 2 years ago

As noted in https://github.com/Psychoanalytic-Electronic-Publishing/OpenPubArchive-Content-Server/issues/134, the client has a problem with the new protocol when there are related documents.

As part of this change, the server doesn't return access data except for a single abstract request or document request. So what's happening here:

The client asks for the document selected: http://development.org:9100/v2/Documents/Document/IJP.083.1133A/?return_format=XML&search=%27%3Ffacetlimit%3D15%26facetmincount%3D1%26smarttext%3Dsuicidal%2Bthoughts%26highlightlimit%3D6%26synonyms%3Dfalse%27

The server responds, and provides correct accessLimited: false information. So the document or abstract can be displayed.

The then client asks for the related document information. The server returns a documentList with three records, WITHOUT checking for access. There's a new flag for that: accessChecked is returned with False in that case.

So the client should now look for accessChecked before using the accessLimited value. However, I set the accessLimited value to None, now by default. It shouldn't be showing up as long as it's never set, just default...though I still see it in the debugger. But now, despite that, the client appears to be working!

bakerac4 commented 2 years ago

However, I set the accessLimited value to None, now by default.

@nrshapiro it looks like for the front end, accessLimited is a boolean (or was). Is that not the case anymore?

Regarding accessChecked - You would want us to check and see if accessChecked is true, and if that is true, check to see if accessLimited is true. Otherwise just full access?

nrshapiro commented 2 years ago

However, I set the accessLimited value to None, now by default.

@nrshapiro it looks like for the front end, accessLimited is a boolean (or was). Is that not the case anymore?

Regarding accessChecked - You would want us to check and see if accessChecked is true, and if that is true, check to see if accessLimited is true. Otherwise just full access?

None basically means "not set". The reason for me using None is because it should indicate that access is neither true nor false...it's simply not determined. If I said it was True, then you would limit access; if I said it was False, you would allow access. I don't want you to do either. By design, you shouldn't see the field accessLimited in the record if it's None. But the FastAPI/Pydantic set can sometimes let it through. So to be more straightforward and deterministic, I added accessChecked as an overriding rule:

If accessChecked is False, then you can't rely on the accessLimited value of True or False. It's undetermined.
If accessChecked is True, then you can rely on the value of accessLimited to indicate whether access is permitted.

Simple as that. I set the accessLimited default back to True so it should always be there. But don't use the value if accessChecked is False. Clear?

bakerac4 commented 2 years ago

@nrshapiro in the case that access is undetermined, what should happen in all the places we were using accessLimited?

nrshapiro commented 2 years ago

The problem is (for whatever reason, perhaps good ones), the client app (APP) does multiple OPAS fetches/queries in a retrieval.

The only fetches/query that OPAS will check now, is 1) a document retrieval, or 2) a request for an abstract with a single result 3) a search that only produces a single result.

The purpose of the above change is to limit how many messages it has to exchange with PaDS.

But the APP seems to do multiple requests to the server for every document you view. And while the Document (or Abstract) request is checked by OPAS/PADS to provide the APP with information, the subsequent requests it makes to fill in the infocards etc. are not checked. Right now, if you try the example at the top of this issue on Stage, you will see the result: the document is shown in it's entirety, and then it displays a message that shows that you have access to the document.

This only happens when there are related documents, due to the extra query the APP makes to OPAS.

You need to abide by/keep the accessLimited value for the document or abstract being displayed, even if in that batch of queries you do at once, you subsequently do a query, where accessChecked is False. Since I'm not holding back the accessLimited flag by None, accessChecked=False means accessLimited=True (the default) is meaningless.

Whenever you actually need to display a document or abstract, which by definition now is when you ask the server for a single document or single abstract, you will ALWAYS get accessChecked=True and the valid value of accessLimited.

nrshapiro commented 2 years ago

@bakerac4

This demonstrates the problem caused in the client by the server accessLimited change. Regular documents work. Documents with RelatedDocuments (see the icon) cause the problem. Due to the query following the document retrieval, the client is showing that message.

https://user-images.githubusercontent.com/4837554/137528184-0eb192df-4872-4357-b47c-71d818444e9a.mp4

It looks like it should be a simple fix.

bakerac4 commented 2 years ago

@nrshapiro while working on this, I had a question. Why are we returning the document we are viewing in the related document call. i.e. how can the document IJP.083.1133A be related to itself?

Would it not make more sense to not include that in the related call? Then A) this problem would go away and B) the user wouldn't see the same document they are viewing in that card which is a bit weird

nrshapiro commented 2 years ago

@nrshapiro while working on this, I had a question. Why are we returning the document we are viewing in the related document call. i.e. how can the document IJP.083.1133A be related to itself?

Would it not make more sense to not include that in the related call? Then A) this problem would go away and B) the user wouldn't see the same document they are viewing in that card which is a bit weird

@bakerac4 Yes, you are right--that wouldn't make sense, and that's a great question, because it actually pinpoints where the problem is: The problem is, that if the client is showing the "requested" document in MoreLikeThis , the client is getting data from the wrong part of the record!

You should be getting the results for MoreLikeThis from the similarityMatch>similarDocs portion of the return data. The document requested (in this case CPS.056.0120A), is simply listed at the outside of the set, because you could in other cases search for more than one document and get similar results for each (so each "set", in [], would be labeled with the articleID that was used to match).

In the set below, you can see the set of similarDocs for CPS.056.0120A includes 5 documents, and doesn't include CPS.056.0120A (I cut a lot fields out of the subrecords to make it easier to see, but you can try this easily in the interactive API, just plug the CPS value into the morelikethis parameter).

https://stage-api.pep-web.org/docs#/Database/database_morelikethis_v2_Database_MoreLikeThis__get

Now true, in the response set above the similar set, it does give all the information for the document you are matching. But that's not part of the data you should be putting in the MoreLikeThis infocard. You should just use the data under similarDocs.

        "similarityMatch": {
          "similarDocs": {
            "CPS.056.0120A": [
              {
                "documentID": "IJPSP.009.0222A",
                "docType": "ART",
                "documentRef": "Summers, F. (2014) The Bonds of Empathy: Beyond the Selfobject Concept. International Journal of Psychoanalytic Self Psychology 9:222-236",
              },
              {
                "documentID": "PPERSP.016.0316A",
                "docType": "ART",
                "documentRef": "Grossmark, R. (2019) The Anguish of Fatherhood. Psychoanalytic Perspectives 16:316-325",
              },
              {
                "documentID": "IJPSPPSC.012.0101A",
                "docType": "ART",
                "documentRef": "Rizzolo, G. S. (2017) Alterity, Masochism, and Ethical Desire: A Kohutian Perspective on Levinas’ Ethics of Responsibility for the Other. Psychoanalysis, Self, and Context 12:101-115",
              },
              {
                "documentID": "PSC.051.0562A",
                "docType": "ART",
                "documentRef": "Tang, N. M. & Smith, B. L. (1996) The Eternal Triangle across Cultures: Oedipus, Hsueh, and Ganesa. Psychoanalytic Study of the Child 51:562-579",
              },
              {
                "documentID": "IFP.028.0140A",
                "docType": "ART",
                "documentRef": "Gonzalez-Torres, M. A. & Fernandez-Rivas, A. (2019) The disturbing presence of the father: Paternal function and its initial developments. International Forum of Psychoanalysis 28:140-146",
              }
            ]
          },
          "similarMaxScore": 30.62795,
          "similarNumFound": 108478
        },

If it's hard to see there, here's the whole record in my debugger, which allows folding of the subfields for similarityMatch, making it easy to see where you should be getting the data for that infocard from:

Getting your data from that section seems to me like that should solve any issue like that. I would say ignore anything that's not in the similarityMatch part of the record, because you don't need it (and in fact, you can make the call and tell it not to return the abstract. I don't see it returning any access data either, so there's nothing to confuse the client. Note that if you are using v2/Database/Search instead of /v2/Database/MoreLikeThis you get the same compartmentalization, but certainly calling MoreLikeThis is better because it is 1) clearer in intent and 2) doesn't skew the search statistics by recording two search endpoint calls in the log, it would record the MoreLikeThis call instead.

cc: @rickardnyman

bakerac4 commented 2 years ago

@nrshapiro Im a bit confused due to the fact that were talking about related documents, not more like this. Which are two different cards. The morelikethis card does indeed use that endpoint. But the related documents card uses the search endpoint to hit the following

https://stage-api.pep-web.org/v2/Database/Search/?facetquery=art_qual%3A(%22IJP.083.1133A%22)&abstract=true&synonyms=false&formatrequested=XML

Are you saying that related documents should also be using the /v2/Database/MoreLikeThis endpoint as well? if thats the case, what are the differences between the cards?

nrshapiro commented 2 years ago

@bakerac4 @rickardnyman

Ah, I see. The problem there is that you are doing a search of a database, using one of the fields. The server doesn't have an endpoint like moreLikeThis for special processing for related fields. So like any database, you are going to find all matching documents to artqual.

And artqual, which is the field that relates documents, does not always match the document you are looking at. It matches one of them. So for example, if you select document IJP.078.0335A (Britton, R.), the art_id for that is IJP.078.0335A, but the art_qual is IJP.076.0019A. The search endpoint, when searching for "IJP.076.0019A", is just finding all documents that have art_qual="IJP.076.0019A". Why would it remove IJP.078.0335A? And of course, per the search, you are asking for ALL documents matching art_qual="IJP.076.0019A".

My first question (before getting to the problems below), is while it may not make sense to include the document you are looking at, you are calling the query to fill the related documents list...why are you looking at the accessLimited and similar fields anyway? You don't need them to fill the list. In fact, you don't need to look at them for any search, do you? They are unnecessary in any search based on the way the client works...permission only matters for abstract or document calls.

However, considering you are doing a generic Search, one easy way to remove the main document from the results would be for the client to change the query to ask the server to exclude the main document:

art_qual:IJP.076.0019A && art_id:(NOT IJP.078.0335A)

But this simple solution actually turns out to be problematic in a few ways. That query does not work directly in Solr Q, the query field. I don't understand why. It looks like a bug. It returns 0 results. It should return 2.

It works in Solr if you separate the two in the q (query) and fq (filter query) fields of the search:

q: art_qual:IJP.076.0019A fq: art_id:(NOT IJP.078.0335A)

Gives 2 results (3 if you don't use the fq, which means it removes the one article).

However, there's another complication. The solr library I'm using doesn't work when you pass it this same query. It returns 0 results.

If you use the endpoint /v2/Database/ExtendedSearch which does a direct Solr query, it works. You can do that by a post in the API using the request body:

{
  "solrcore": "pepwebdocs",
  "solrquery": "art_qual:(IJP.076.0019A)", 
   "solrargs": {
      "fq": "art_id:(NOT IJP.078.0335A)",
      "fl": "art_id, art_title, art_title_xml, art_subtitle_xml, art_author_id, art_authors, art_citeas_xml, art_info_xml, art_sourcecode, art_sourcetitleabbr, art_sourcetitlefull, art_sourcetype, art_level, para_art_id, parent_tag, para, art_vol, art_type, art_vol_title, art_year, art_iss, art_iss_title, art_newsecnm, art_pgrg, art_lang, art_doi, art_issn, art_isbn, art_origrx, art_qual, art_kwds, art_cited_all, art_cited_5, art_cited_10, art_cited_20, art_views_lastcalyear, art_views_last1mos, art_views_last6mos, art_views_last12mos, art_views_lastweek, reference_count, art_fig_count, art_tbl_count, art_kwds_count, art_words_count, art_citations_count, art_ftns_count, art_notes_count, art_dreams_count, file_last_modified, file_classification, timestamp, score, id"
   }
}

and that works, because it goes to Solr directly. You can limit the 'fl' list to just the fields you want.

So:

you could do that.
Or you could simply have the client remove/ignore the article with the art_id you're looking at in the results.
Lastly, it bothers me that the art_id(NOT id) query doesn't work, regardless of it's use, so I'm looking at that.
Since this is a useful function, I could also build it in as an endpoint, and work around any solr or solr library issues.

At this point, the method doesn't matter. Unfortunately, I don't think a solution arriving late today or tomorrow is going to do me any good. It's now too late to push the latest server (with any fix) to production before I leave for vacation. I wouldn't have time to test it thoroughly. So I think it will have to wait until I get back in the beginning of November.

Meanwhile, I'm looking at why those NOT queries don't work as they should.

nrshapiro commented 2 years ago

@bakerac4 @rickardnyman @adistasio

I went ahead and implemented #4, it was easier and a lot quicker than #3 figuring out why the NOT query wasn't working...will have to do that when I get back.

It's tested here, and I'm in the process of loading a new version to Stage. The new endpoint is

2021-10-20_19h54_32

Note that you just need to supply the MAIN ARTICLE ID (the one being viewed). It looks up the data in the relatedrx field (for us art_qual, but that's configurable) that holds related IDs.

The other benefit of this approach is it's another case where you won't be calling the search endpoint, for what is a standard infocard everytime (or most of the time) when you look at an article. That skews the user search data statistics. This search isn't recorded there. https://stage-api.pep-web.org/docs#/Database/database_related_to_this_v2_Database_RelatedToThis__get

Data Example (results not formatted here though): https://stage-api.pep-web.org/v2/Database/RelatedToThis/?relatedToThis=IJP.078.0335A

This is up on stage. Please go ahead and modify the client to fix it THIS WEEK and push it to stage so that when I return I can run all necessary testing and push to production within a day. If I get a chance, I'll do some user testing while away if it's all on Stage.

nrshapiro commented 2 years ago

Latest client uses the new endpoint. Closing.

nrshapiro commented 2 years ago

@adistasio @bakerac4 @bdami-gavant

We have a problem here. The "new" endpoint I added in November was designed to return a list of the related papers, but not the paper that the user is viewing (users complained about that before).

As you can see in the description above,

relatedToThis - Enter a document ID to find all related documents per a common schema value

You should send it the document that the user is viewing in the relatedToThis field. Not the art_qual value in the document. The API considers that you are asking for documents related to the one the user is viewing.

So it works like this:

1) The server searches for the Solr document record that matches the ID. The one the user is looking at. 2) The server then gets the art_qual id from that record.
3) Then the server does a second search, for all documents that have a matching art_qual. As part of this search, the server filters out the original art_id (the relatedToThis parameter), so the document the user is viewing is not in the list.

This works as I would expect if the user were calling the API directly. The user would not know the art_qual value...they would only know the ID of the document they were interested in finding related documents.

Since the client does know what documents have an art_qual, though, you should continue to be "smart" and only call the API if you know the document has related documents via art_qual. But you should send the viewed article id, not the value from art_qual. Clear?

Can we please fix this for this release? It shouldn't be much to do. I'm going to mark it as such in the issues.

Psychoanalytic-Electronic-Publishing / PEP-Web-User-Interface

Authentication Changes--Problem with Related Documents #563