Lens reader for an NLM XML by URL

MikeTaylor commented 11 years ago

It would be great to be able to read ANY valid NLM XML file using Lens just by specifyin the URL. Something like http://lens.elifesciences.org/xml-url/https://peerj.com/articles/36/

michael commented 11 years ago

Not sure if on the fly conversion performs well enough. I thought about making Lens aware of different content repositories though, which means that content providers could make their repository available by exposing a service conforming to the JSON format Lens can read.

E.g. http://lens.elifesciences.org#peerj would let you browse within the available articles of that repository.

More work needs to go into the specification of the Lens Document Format as well as documenting the conversion process and providing tools that help with the conversion process.

MikeTaylor commented 11 years ago

This seems like a good route to take.

Although on-the-fly conversion would surely be a useful tool for testing and debugging. Maybe you could consider making it available for beta partners so they can report to you where they find problems in the conversion or display?

On 6 June 2013 23:44, Michael Aufreiter notifications@github.com wrote:

Not sure if on the fly conversion performs well enough. I thought about making Lens aware of different content repositories though, which means that content providers could make their repository available by exposing a service conforming to the JSON format Lens can read.

E.g. http://lens.elifesciences.org#peerj would let you browse within the available articles of that repository.

More work needs to go into the specification of the Lens Document Format as well as documenting the conversion process and providing tools that help with the conversion process.

— Reply to this email directly or view it on GitHubhttps://github.com/elifesciences/lens/issues/6#issuecomment-19079196 .

michael commented 11 years ago

Our converter will be open sourced soon, we just haven't had the time to review the codebase and get the documentation right. It would be great to have support from other Open Access publishers with regards to accessing fresh content. Lens could also be a useful tool outside of the science community, e.g. for viewing software documentation etc... But then again... one step at a time. :)

ivangrub commented 11 years ago

The converter is currently built to support the XML standard of: http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/style.html . There might still be some edge-case errors that will come up, but should be debugged very quickly.

I will look into building a more robust API for the converter so that it is easier to plug into different workflows. The main issue with on-the-fly client side conversions is that each publisher would have to provide definitions for their figure URLs that would either have to be included in the converter, or added as a post processing step in that publisher's workflow. I think a static repository supported by each publisher would probably be the ideal way to move forward as well.

MikeTaylor commented 11 years ago

Why does the NLM->JSON converter need to know the individual publishers' figure URL conventions? (I'm not saying it doesn't just curious as to what could invoke such a requirement.)

ivangrub commented 11 years ago

Unfortunately, the XML tags do not provide a src attribute. Through a little hacking it is possible to figure out how to stitch together the URL, but that is a tedious and error-prone process.

The image and video nodes in the JSON contains a url property which needs to point to the image or video to be displayed. If that property is empty, then the article will render fine, except you will be missing all of the media bits.

MikeTaylor commented 11 years ago

Wait ... NLM, the universally used canonical format for representing scholarly articles ... HAS NO WAY TO LINK TO THE FIGURES? Did I understand you right?

ivangrub commented 11 years ago

Yes. The figure tags contain a graphic-id attribute that gives the figure's name that can be stitched together with a image styled extension. This depends on having local storage of all the article's media (images, video, supplementary material, source code, etc.) in the same path as the article's XML though. Universal access to the figures is not available at the moment. Or at least I have yet to see it. If you know of a way to do this, I would be more than happy to hear about it and quickly implement it into the converter.

MikeTaylor commented 11 years ago

Holy poo. Well, I am astonished at such designed-in brain-damage. Sorry I can't help -- I am no NLM expert, and certainly don't know a better way. So, yes, your converter script will need, or need access to, a set of per-publisher or per-journal recipes for turning graphic-IDs into URLs.

rdmpage commented 11 years ago

Mike, as an example, here's a fragment from ZooKeys:

To render this you need to figure out what the full path is to the image. You don't get a URL :(

rdmpage commented 11 years ago

Of course, if you have a DOI for the figure you may have better luck...

ivangrub commented 11 years ago

lt is quite unfortunate. I think part of the reason for this is that the URL path for each publisher is subject to change over time so instead of having to update the XML each time with the new URL, PubMed decided to just push for having local storage of all of the figures and associated files. That is why I think a local repository of the converted JSONs, per publisher, would be the best. We can work with the interested parties to add the converter to their workflows and then link these repositories with Lens.

Even having the figure DOI is not ideal though. In that case you would have to do a http.request to get the html of the DOI page, and then scrape it for the image src URL. In async workflows like node.js, you could do this, but it would take much longer than investing a little time up front to have an organized workflow that publishes all of the associated files of an article's XML to Amazon or another cloud service where the URLs will not change.

MikeTaylor commented 11 years ago

I see from the ZooKeys example that the attribute in question is "href" which is at least suggestive that it's the name of an HTTP-addressable resource relative to the address of the document that contains it. Doesn't it follow that if you download a ZooKeys XML from from http://www.pensoft.net/J_FILES/1/articles/5334/5334-G-2-layout.xml , then refers to http://www.pensoft.net/J_FILES/1/articles/5334/ZooKeys-307-001-g001.jpg ?

Answer 1 (from the W3C's XLink spec at http://www.w3.org/TR/xlink/#link-locators ) is that, yes, the href " must be a URI reference as defined in [IETF RFC 2396]". Answer 2 (from simple testing) is that, no, this URL doesn't work. Darn.

rdmpage commented 11 years ago

Almost, http://www.pensoft.net/J_FILES/1/articles/5334/export.php_files/ZooKeys-307-001-g001.jpg

rdmpage commented 11 years ago

So, for each journal you need to figure out how they serve images, and everyone does it differently.

MikeTaylor commented 11 years ago

Surely in this case the ZooKeys document is flatly invalid.

ivangrub commented 11 years ago

@MikeTaylor, ideally that is exactly how this would end up working. Realistically, aside from having to hack each publisher with tweaks like what @rdmpage just noticed by adding 'export.php_files' to the URL path, we would have publishers get on board with reorganizing their URLs in a manner that actually makes sense.

Anyone want to start leaning on the publishers to have a standard URL structure?

MikeTaylor commented 11 years ago

Let's try some other publishers ... PeerJ next, as they're my favourites ...

MikeTaylor commented 11 years ago

My PeerJ article is https://peerj.com/articles/36.xml and has no "xml:base" element.

The first figure is expressed as:

which means that the URL of the actual figure should be https://peerj.com/articles/fig-1.png

To my enormous disappointment, the figure is not there. Distressingly, the full-size version seems to be at https://dfzljdn9uc3pi.cloudfront.net/2013/36/1/fig-1-full.png

I expected better from PeerJ. We should probably get in touch with them.

ivangrub commented 11 years ago

Even if the url path were something like this: https://peerj.com/articles/ARTICLE_ID/ that would be fine. Your figure would then be https://peerj.com/articles/36/fig-1-full-png and the XML would be at https://peerj.com/articles/36/36.xml.

If you manage to get in touch with PeerJ, you can point them to this issue thread and we can figure out how to best fix this problem.

gnott commented 11 years ago

Tagging Figures, Graphics http://dtd.nlm.nih.gov/publishing/tag-library/n-wq32.html

External Link http://dtd.nlm.nih.gov/publishing/tag-library/n-hya0.html In eLife XML it looks like the DOI URL is included in a <ext-link> tag, though following that URL will not necessarily bring you to the image file itself. Another wrinkle is providing the same graphic in multiple sizes / resolutions, and multiple file formats.

IanMulvany commented 11 years ago

It's all rather unfortunate, it would be interesting to cross post the issue to the JATS list, in the interm we now have content negotiation on eLife: http://www.elifesciences.org/elife-now-supports-content-negotiation/, but to mirror you comment Mike - poo.

hubgit commented 11 years ago

In JATS XML, I think the image paths are designed to be relative to the XML file when they're bundled together into a ZIP file for archiving. I'm not sure if this is how it's implemented by all publishers, though.

For PeerJ, an article at https://peerj.com/articles/36/ has an XML file at https://peerj.com/articles/36.xml and figures at https://peerj.com/articles/36/fig-1.png (for example).

MikeTaylor commented 11 years ago

Excellent. Then it seems that just adding the relevant xml:base attribute to PeerJ's XML will fix this.

hubgit commented 11 years ago

In theory, yes, but it looks like articles with an xml:base attribute don't validate against the JATS DTD.

I've only dealt with images in cross-publisher articles when served from PMC, where they're all at predictable URLs.

ivangrub commented 11 years ago

The PMC URLs are easier to predict because they use the xlink:href in the graphics tag to point to the figure. The only issue there is getting the PMC ID for the article which is not in each publisher's XML. The best way to do it in that case is to probably do on-the-fly conversions by requesting the XML from PMC and pulling the PMC ID from their version of the XML.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3539393/bin/elife00170f001.jpg

which would translate to http://www.ncbi.nlm.nih.gov/pmc/articles/PMC_ID/bin/fig_xlink:href.jpg

Daniel-Mietchen commented 11 years ago

Two pointers to related projects:

https://github.com/konrad/JATS-to-Mediawiki converts JATS from PMC into MediaWiki markup. We are planning to deploy it on Wikisource this summer.
https://github.com/erlehmann/open-access-media-importer uploads audio and video files from PMC to Wikimedia Commons. We have thought of taking the publishers' XML directly (cf. https://github.com/erlehmann/open-access-media-importer/issues/45 ) rather than through PMC (in order to bridge the gap of weeks to months between the XML being delivered to and publicly exposed at PMC) but not pursued this further, precisely because it is not obvious where to locate the media files (in PLOS XML, for instance, they do not even have file names).

Apart from that, there is a number of problems in the XML delivered to PMC with regard to the signaling of licensing and MIME types (cf. http://chrismaloney.org/notes/OAMI%20JatsCon%20Submission,%202013 and http://outreach.wikimedia.org/wiki/GLAM/Newsletter/November_2012/Contents/Open_Access_report ), and these problems will also be mentioned in a breakout session on the reuse of OA materials in Wikimedia contexts at OAI8 (cf. https://indico.cern.ch/contributionDisplay.py?sessionId=10&contribId=67&confId=211600 ).

Klortho commented 11 years ago

@MikeTaylor wrote, "Surely in this case the ZooKeys document is flatly invalid." No, the value in the xlink:href attribute is a valid (relative) URL, but it doesn't point to anything. It's the equivalent of a broken link.

I think the disconnect comes from two facts: JATS doesn't include a standard packaging format for articles, so publishers are free to define how to resolve these relative URLs however they want; and that the XML, if it is served at all, is usually served independent of the figures and other media (and, as Alf pointed out, it doesn't even allow an xml:base attribute). It's the same thing that would happen if you emailed somebody a raw HTML file (and only the HTML file) that you pulled off of a random website.

kaveh1000 commented 11 years ago

JATS is full of surprising tags, e.g. , allowing the creation of valid documents with zero structure in the references, e.g. here.

A central policy of NIH seems to be to give publishers any tag that they want or need. This allows all the bad practices in publishing to continue...

MikeTaylor commented 11 years ago

@Klortho wrote:

"'Surely in this case the ZooKeys document is flatly invalid.' No, the value in the xlink:href attribute is a valid (relative) URL, but it doesn't point to anything. It's the equivalent of a broken link."

Yes, that is a much more precise way to articulate the problem. I shouldn't have used the word "invalid" which of course in the XML world means "not conforming to the schema".

"JATS doesn't include a standard packaging format for articles, so publishers are free to define how to resolve these relative URLs however they want."

Really? Seems like flagrant misuse of the XLink attributes to me. "xlink:href" has a specific meaning, detailed at http://www.w3.org/TR/xlink/#link-locators and dependent on the "xml:base" attribute for its interpretation. As the spec. says, "If the URI reference is relative, its absolute version MUST be computed by the method of [XML Base] before use".

"It's the same thing that would happen if you emailed somebody a raw HTML file (and only the HTML file) that you pulled off of a random website."

... which is precisely why no-one does that.

@kaveh1000 writes:

"A central policy of NIH seems to be to give publishers any tag that they want or need. This allows all the bad practices in publishing to continue..."

Except, bizarrely, the "xml:base" attribute that's needed to make this stuff work in a sane way. As this conversation progresses, it looks increasingly as though that is the underlying bug here, no?

At present, we have a worldwide standard representation of academic articles which does not contain the information of how to obtain figures. That is just crazy.

hubgit commented 11 years ago

If an xml:base attribute is not present, the base URL of an XML document is the URL from which it's served, and links are built relative to that. When the XML and image files are all in the same folder (or zip file), which is how it's mostly been used until now, it works fine and is perfectly valid.

When the XML files are online, I think we probably need to start adding more elements with absolute URLs that point to all the different formats and sizes of a figure that are available (when displaying online, you don't actually want the full-resolution PNG that's linked to in the XML file, you probably want one of the JPEG versions). It should be straightforward enough to do, but no-one's had to address it so far, so there's no standard yet. HTML is addressing a similar problem with srcset and image-set.

MikeTaylor commented 11 years ago

It's 2013. I can't believe we're still having this conversation. The NLM DTD is at least ten years old (see the old released at http://dtd.nlm.nih.gov/#id48886) and this still isn't covered? What, did no-one ever want the figures from an XML paper before?

hubgit commented 11 years ago

What, did no-one ever want the figures from an XML paper before?

They did, but anyone working with a single publisher's files has been able to build the URLs, and anyone working cross-publisher has been able to get the figures via PMC.

It would be useful, still, to be able to get a list of all the components of an article with live URLs (via HTML this currently works fine, but the XML has historically been intended for archiving, and the chances of full URLs breaking over the next 10,000 years is quite high). I think it's the kind of thing that OAI-ORE was designed for, but that's possibly over-complex for this use case.

ivangrub commented 11 years ago

@hubgit I agree that it is not difficult to hack together a single publisher's URLs. Getting the figures via PMC is only easy though if your starting XML files are PMC too though (the URL depends on the PMC ID). In my opinion, there is really no need to have a unique identifier for each article other than the DOI.

Leveraging the 600,000+ open access articles on PubMed would be great, but it is difficult due to their term's of use and I do not see much purpose of reinventing the wheel a few times over to provide the exact same service. We will be open sourcing the converter soon. At that point every publisher that is interested in using Lens can make their appropriate magic soup to pull out the figure URLs.

I spent a little time playing around with the XML to figure hacks for PLOS and others, but it is not as simple as @hubgit wrote:

"If an xml:base attribute is not present, the base URL of an XML document is the URL from which it's served, and links are built relative to that."

hubgit commented 11 years ago

@ivangrub I agree with you entirely - I was just pointing out how xml:base works :-)

Klortho commented 11 years ago

I added this comment to the NISO JATS spec comments list. Sadly, it's not a forum -- there's no discussion or any kind of back-and-forth. If I got anything wrong, or somebody wants to add your own point of view, the thing to do would be to add your own comment.

hubgit commented 11 years ago

I've fixed PeerJ's article XML files (e.g. http://peerj.com/articles/36.xml) so they now use absolute URLs, rather than relative URLs, for the xlink:href attributes (i.e. they link to the actual image URLs).

These are the full-size PNG files though, which can be rather large; still looking for best-practices for marking up alternate formats.

MikeTaylor commented 11 years ago

Nice, thanks!

I have nearly learned better than to ask this kind of question, but not quite, so here goes: SURELY the NLM/JATS schema has a way to express this obvious and important concept?

hubgit commented 11 years ago

Yes, wrapping the different formats in <alternatives> already allows alternative formats for the same element (e.g. MathML or graphic versions of a formula).

What I'm not sure about is if the attributes on the <graphic> element are expressive enough to allow a client to know which one to choose: the mimetype and mime-subtype attributes will allow a client to distinguish between formats, but there isn't a width, height or file size attribute, as far as I can tell.

MikeTaylor commented 11 years ago

So you're saying that only MIME type serves to distinguish the variants? Hmm. Can you hack the MIME types, so you use image/png+full image/png+medium image/png+full ?

Better still, has someone already standardised such extended MIME types?

hubgit commented 11 years ago

I don't think that particular standard exists, as "full" is ambiguous (a client still doesn't know what size that is; if it knew the sizes it could assume that the largest is the "full" image).

There's a proposal for a HTML <picture> element, but that assumes that the client has control over the HTML and knows what sizes they want to display the image at, rather than defining the actual sizes of the source files.

MikeTaylor commented 11 years ago

So we need TWO extensions to the NLM schema: xml:base attribute, and height/width attributes on image links.

Klortho commented 11 years ago

This NISO announcement might be of interest to people on this thread; http://www.niso.org/news/pr/view?item_key=095ead17653aacf2db53445611417084f1d052dc

MikeTaylor commented 11 years ago

It's of interest, yes; but not necessarily in a good way. I would much rather they just fixed the obvious bugs in NLM than throw it all out and start again. Poor NISO -- I suppose they feel they have to have SOMETHING to do. See also http://svpow.com/2013/06/22/why-a-niso-effort-to-standardise-altmetrics/

Klortho commented 11 years ago

Packaging was never in the scope of JATS, and for better or worse, I think, never will be. I can't see how this effort is throwing anything out. And, unlike altmetrics, there's a lot of prior work regarding packaging that could be drawn upon. I'm not the biggest fan of NISO, but maybe this effort will help.

ivangrub commented 11 years ago

Hey everyone,

We have open sourced refract (the NLM XML to Lens JSON converter).

https://github.com/elifesciences/refract

Please have a look and start playing around with it. To make it easiest to help out with issues, make development branches for each publisher type and I can help with the necessary tweaks.

Thanks!

IanMulvany commented 11 years ago

This is not a mailing list, this is a feature request, if peoeple with to keep discussing I would reccomend that we port over to a mailing list.

michael commented 11 years ago

You can now drag+drop any NLM file into Lens. You can self-host a Lens article by checking out

https://github.com/elifesciences/lens/tree/0.2.x/dist

and adjusting the index.html to your needs.

MikeTaylor commented 11 years ago

"You can now drag+drop any NLM file into Lens."

That sounds awesome. But I can't figure out how to do it. I went to http://lens.elifesciences.org/ and dragged a link to https://peerj.com/articles/36.xml from another window into the Lens one, but it just loaded the XML itself in place of Lens. How do I make this work?

michael commented 11 years ago

Use http://lens.substance.io for now.

On Wed, Sep 25, 2013 at 9:14 PM, Mike Taylor notifications@github.com wrote:

"You can now drag+drop any NLM file into Lens."

That sounds awesome. But I can't figure out how to do it. I went to http://lens.elifesciences.org/ and dragged a link to https://peerj.com/articles/36.xml from another window into the Lens one, but it just loaded the XML itself in place of Lens. How do I make this work?

Reply to this email directly or view it on GitHub: https://github.com/elifesciences/lens/issues/6#issuecomment-25116029

elifesciences / lens

Lens reader for an NLM XML by URL #6

That sounds awesome. But I can't figure out how to do it. I went to http://lens.elifesciences.org/ and dragged a link to https://peerj.com/articles/36.xml from another window into the Lens one, but it just loaded the XML itself in place of Lens. How do I make this work?