PathwayCommons / factoid

A project to capture biological pathway data from academic papers
https://biofactoid.org
MIT License
27 stars 7 forks source link

AuthorProfiles: Provide error handling, fallback #923

Closed jvwong closed 3 years ago

jvwong commented 3 years ago

Background: Creation of a new document triggers, among may others, an attempt to retrieve OCRID information.

Issue: If any error occurs in the ORCID retrieval, the remaining aspects of doc creation are crippled. On the client side UI, this appears like an fatal error, with 'try again later' message

Propose: Provide a fallback for errors in fillDocAuthorProfiles / fillDoc

jvwong commented 3 years ago

Example PMID that triggers error: 25064009

Error: Unexpected close tag
Line: 5
Column: 7
Char: >
    at error (...factoid/node_modules/sax/lib/sax.js:651:10)
...
    at async fillDocAuthorProfiles (...factoid/src/server/routes/api/document/index.js:365:24)
    at async fillDoc (....factoid/src/server/routes/api/document/index.js:372:3)

Another example that throws the same error: "A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing"

maxkfranz commented 3 years ago

Probably best to just catch and leave the problematic fields empty. Either the author can recover with the 'Set your title' UI, or it can be sorted out manually on a case-by-case basis in the admin UI.

metincansiper commented 3 years ago

Probably best to just catch and leave the problematic fields empty. Either the author can recover with the 'Set your title' UI, or it can be sorted out manually on a case-by-case basis in the admin UI.

@maxkfranz I created #924 which fixes the issue for ORCID. I see that if an invalid email is entered the document is still created as we wish but still the invalid email is stored as is and so when "set article title" button is clicked email editing text box does not appear. Should I play around the logic there so that the email address is stored as null when it is invalid?

maxkfranz commented 3 years ago

Probably best to just catch and leave the problematic fields empty. Either the author can recover with the 'Set your title' UI, or it can be sorted out manually on a case-by-case basis in the admin UI.

@maxkfranz I created #924 which fixes the issue for ORCID. I see that if an invalid email is entered the document is still created as we wish but still the invalid email is stored as is and so when "set article title" button is clicked email editing text box does not appear. Should I play around the logic there so that the email address is stored as null when it is invalid?

The email issue you've brought up may not be worth pursuing at this time, since it wouldn't affect any of our user workflows:

(1) User signs up on the homepage. This workflow isn't affected, and the email is always provided by the user.

(2) A new document is sent to a prospective user in an email. In this case, the document is empty. However, it's already associated with the author's email address, the same email address we used to send them the related articles email.

In both of these workflows, the email is valid. Do you have a different case in mind for invalid emails?

metincansiper commented 3 years ago

The email issue you've brought up may not be worth pursuing at this time, since it wouldn't affect any of our user workflows:

I was just considering the case where user provided the email through home page but the provided email address was ended up being invalid. I guess that may happen because of a typo. However, in that case we may consider it similar to the case where the user enters a wrong but valid email address. Therefore, it may not be a big deal as you mentioned.

maxkfranz commented 3 years ago

I was just considering the case where user provided the email through home page but the provided email address was ended up being invalid. I guess that may happen because of a typo. However, in that case we may consider it similar to the case where the user enters a wrong but valid email address. Therefore, it may not be a big deal as you mentioned.

That's a good point, but as you mentioned, it's not a big problem or a high priority. Better to focus on other things for now

jvwong commented 3 years ago

I'm noticing with my failing cases (https://pubmed.ncbi.nlm.nih.gov/25064009/) that the HTML and HTTP code is 503 (503 Service Temporarily Unavailable). I'm guessing this might be happening because there's a lot of authors, consequently sending off a whole bunch of requests to their web service? Hence the Unexpected closed tag error in the logs.

maxkfranz commented 3 years ago

That's a good motivation to revisit a queueing system for external requests. p-limit or p-queue should address that easily

metincansiper commented 3 years ago

I'm noticing with my failing cases (https://pubmed.ncbi.nlm.nih.gov/25064009/) that the HTML and HTTP code is 503 (503 Service Temporarily Unavailable). I'm guessing this might be happening because there's a lot of authors, consequently sending off a whole bunch of requests to their web service? Hence the Unexpected closed tag error in the logs.

That's a good motivation to revisit a queueing system for external requests. p-limit or p-queue should address that easily

@jvwong the same case worked when I tried and also I could not catch any error message for 25064009 this time. I think it may support your guess.

@maxkfranz I guess @jvwong probably means that the pubmed query is sometimes failing independent of our calls from biofactoid (Am I right?). In that case would queueing help or should we try fetch retry similar to we are doing for indra queries (I guess it would work if the service is unavailable for a short time).

jvwong commented 3 years ago

@jvwong the same case worked when I tried and also I could not catch any error message for 25064009 this time. I think it may support your guess.

I think its the ORCID web service that is sending back errors - is there a specific rate limit? Do they offer api keys?

maxkfranz commented 3 years ago

Another approach to consider is whether we need to make so many queries in the first place. Currently, I believe we're making one ORCID query per author. Would it be possible to make a single ORCID query for the paper itself, which would return the list of all authors of the paper? This approach would be simple and sidestep a lot of issues.

metincansiper commented 3 years ago

I think its the ORCID web service that is sending back errors - is there a specific rate limit? Do they offer api keys?

Another approach to consider is whether we need to make so many queries in the first place. Currently, I believe we're making one ORCID query per author. Would it be possible to make a single ORCID query for the paper itself, which would return the list of all authors of the paper? This approach would be simple and sidestep a lot of issues.

I found out this discussion from the past (https://groups.google.com/g/orcid-api-users/c/zlIeDAr2GxI). The response of technical operations director at Orcid.org from 2016 is that:

Definitions:
Request a second - Number of request that can be made a second.
Burst - Number of request we will allow to be queued before rejecting. The request in the queue are slowed down at the request a second rate.
v1.2
Request a second - 8
Burst - 40
2.0_rc+
Request a second - 24
Burst - 40
If you exceed the burst, you'll get a 503 responses. 

Based on that making the query for paper itself as Max suggested may help. It is possible to make such a query which will return csv result. Then I can parse the lines of query result to find the matches per author.

metincansiper commented 3 years ago

I created #941 that refs this issue.