Open CAYdenberg opened 8 years ago
Additionally this article has no <body> tag at all, which causes Lens to die ungracefully. Maybe we could separate the selection of the figure elements and their assignment to nodes into different methods.
Actually I think what's needed is a dedicated parser for PubMed Central. Is this something worth committing to the main project or is it too specialized a use?
The test method should include a test to make sure the file came from PMC as well as making sure it's actually an open-access paper, since PMC contains a mix.
The default Lens converter is unable to extract figures from the following document:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=4443854
due to the fact that the figures are outside the <body> tag. The
NlmToLensConverter.extractFigures
method searches only within the body.I solved in a custom converter by calling super and then further parsing what I was looking for, but this involved a lot of duplicate code. I was wondering if you'd be interested in generalizing by changing this method. The obvious possible trade-off would be performance.