geneontology / amigo

AmiGO is the public interface for the Gene Ontology.
http://amigo.geneontology.org
BSD 3-Clause "New" or "Revised" License
29 stars 17 forks source link

Reference page not correctly displaying article title. #655

Open malcolmfisher103 opened 1 year ago

malcolmfisher103 commented 1 year ago

The reference page http://amigo.geneontology.org/amigo/reference/PMID:30352852 displays the article title as '[object Object]' rather than 'Evolutionarily conserved Tbx5- Wnt2/2b pathway orchestrates cardiopulmonary development'.

kltm commented 1 year ago

@malcolmfisher103 Huh, nice catch. Let me look into that.

kltm commented 1 year ago

It seems like it might be due to the embedded markup in the title. Doesn't work: <ArticleTitle>Evolutionarily conserved <i>Tbx5</i>-<i>Wnt2/2b</i> pathway orchestrates cardiopulmonary development.</ArticleTitle> Works: <ArticleTitle>Adrenocortical function in 4-APP treated rats: a coupled stereological and biochemical study.</ArticleTitle>

Digging into the JS a little:

jxon = require('jxon');
str = '<PubmedArticleSet><PubmedArticle><MedlineCitation Status="MEDLINE" Owner="NLM"><PMID Version="1">30352852</PMID><DateCompleted><Year>2019</Year><Month>05</Month><Day>22</Day></DateCompleted><DateRevised><Year>2019</Year><Month>05</Month><Day>22</Day></DateRevised><Article PubModel="Print-Electronic"><Journal><ISSN IssnType="Electronic">1091-6490</ISSN><JournalIssue CitedMedium="Internet"><Volume>115</Volume><Issue>45</Issue><PubDate><Year>2018</Year><Month>11</Month><Day>06</Day></PubDate></JournalIssue><Title>Proceedings of the National Academy of Sciences of the United States of America</Title><ISOAbbreviation>Proc Natl Acad Sci U S A</ISOAbbreviation></Journal><ArticleTitle>Evolutionarily conserved <i>Tbx5</i>-<i>Wnt2/2b</i> pathway orchestrates cardiopulmonary development.</ArticleTitle></MedlineCitation></PubmedArticle></PubmedArticleSet>';
j = jxon.stringToJs(str);
j.PubmedArticleSet.PubmedArticle.MedlineCitation.Article.ArticleTitle

Returns:

{ i: [ 'Tbx5', 'Wnt2/2b' ],
  _: 'Evolutionarily conserved-pathway orchestrates cardiopulmonary development.' }

from string: Evolutionarily conserved <i>Tbx5</i>-<i>Wnt2/2b</i> pathway orchestrates cardiopulmonary development. This means that the jxon transformation gives no way to recover the correct title. I wonder if the embedded HTML is intended or not and if this might be an upstream issue?

Either way, I suspect this is limited to titles with embedded features and if maybe just the _ object portion could be used as a "good enough" for situations like this. Alternatively, it may be possible to just string scan and grep out the title from the original string in the error case that an object is returned.

pgaudet commented 1 year ago

This seems intentional , see screenshot

image

Can we not skip these html elements?

kltm commented 1 year ago

@pgaudet Correct, from the examples above. They can be removed, but the JSON, as it stands, does not allow for a correct title to be extracted. We'd have to crack open the hood and have a fall-through case (or otherwise switch) to scanning the returned XML some other way. The main point here is that it is not completely trivial to fix, but like fairly easy. That said, I'm still curious why there is HTML embedded in the XML (which strikes me as a little odd).

AlexanderNull commented 1 year ago

Randomly dove down this rabbit hole and thought I should post what I found. Pubmed's database+api looks to be a little less feature rich than Pubmed Central's and so there's some interesting data storage formats here and some response types which aren't fully supported such as a JSON response that contains the AbstractText data (which is also affected in this particular article's case). There's not necessarily a quick fix available off the shelf but there are a couple of options:

1) Switch to requesting response in the rettype=medline format. This will provide the results in a text file which is formatted in a way where all of the currently used fields can be parsed and wherein there are no embedded html tags in the response. The response for this specific problematic article can be viewed here: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=30352852&rettype=medline The major drawback I can see of this response type is the difference in author name formatting where this return type provides author names as "Last, First (Middle)" format as opposed to the two separate fields currently parsed from the XML. Not a huge issue on its own as the string can be split, results reversed, and rejoined but this does expose the result to potential unforeseen corner cases.

2) Pull title and abstract text from the medline format above, and the other formats from the xml format currently requested. This produces potentially the least amount of churn as far as results goes, but doubles up all API requests. From a responsible API user perspective it might be nice to avoid doubling up requests if possible for load on pubmed's servers as well as latency on the page load.

3) Switch from using JQuery to create the text nodes and instead first creating a vanilla JS textnode via document.createTextNode(title). This will parse the title as a plain text even when this textnode is then appended via JQuery. As a result the html tags will appear as plaintext in the resulting title. Not the prettiest result but the easiest to implement and presents the fewest potential bugs.

4) Iterate through the xml results for both title and abstract, searching for each closing tag and then removing each corresponding opening tag. Assuming there's no additional attributes on the html tags provided then this is not going to be super bug prone, but even in that best case scenario there's likely to eventually end up being some corner case that gets mangled. If there are attributes on the included tags then this becomes even less fool proof to work 100% of the time.

Got any opinion on one of these options or an alternative? Want assistance?

kltm commented 1 year ago

Thank you @AlexanderNull for taking a bit of a look at this. Druthers would not to have custom medline parsing (1 and 2). The same for 4.

I suspect the right answer is in the neighborhood of your "3": dig out the area around title by some method, maybe jxon still, and then convert it to a string without jxon's (frankly) incorrect method. I think the hack here was trying to do this with old in-browser libs (as jxon is rather out-of-date now).

AlexanderNull commented 1 year ago

Even without changing the rest client we could take the raw results and instead of passing them to jxon, pass them to the built in DOMParser (the native object that caused all JXON support to be dropped from official specs). Using something like this around https://github.com/geneontology/amigo/blob/386c405edabbdb23a77adf1932344128e15e733f/javascript/web/ReferenceDetails.js#L279:

var parser = new DOMParser();
var xmlDoc = parser.parseFromString(xml);

// omitting various if checks for example brevity
var op = xmlDoc.querySelector('PubmedArticleSet PubmedArticle MedlineCitation Article');

// skipping over some more stuff
var title = op.querySelector('ArticleTitle')?.textContent || 'n/a';
var abstract = op.querySelector('Abstract')?.textContent || 'n/a';

document.querySelector has support all the way back to Chrome V1 so that shouldn't present any issues. This approach actually provides a double benefit as the document object contents can either be accessed via the innerHTML prop or as done in my example the textContent prop, the later of which automatically strips the embedded html tags from the results!

What do you think about this approach? I don't have all of AmiGO running locally yet but it seems like a pretty straightforward switch and the added sanitization is nice.

kltm commented 1 year ago

@AlexanderNull That sounds like a good strategy: bypassing the faulty jxon after the REST call. If you can give me a little more context for the code or a mock PR, I can get it in and test if NP from my end. AmiGO can be tricky to build and work with these days (although some fixes there imminent).