Figure extraction - Githubissues

CAYdenberg commented 8 years ago

The default Lens converter is unable to extract figures from the following document:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=4443854

due to the fact that the figures are outside the <body> tag. The NlmToLensConverter.extractFigures method searches only within the body.

I solved in a custom converter by calling super and then further parsing what I was looking for, but this involved a lot of duplicate code. I was wondering if you'd be interested in generalizing by changing this method. The obvious possible trade-off would be performance.

CAYdenberg commented 8 years ago

Additionally this article has no <body> tag at all, which causes Lens to die ungracefully. Maybe we could separate the selection of the figure elements and their assignment to nodes into different methods.

CAYdenberg commented 8 years ago

Actually I think what's needed is a dedicated parser for PubMed Central. Is this something worth committing to the main project or is it too specialized a use?

The test method should include a test to make sure the file came from PMC as well as making sure it's actually an open-access paper, since PMC contains a mix.

elifesciences / lens

Figure extraction #140