dbmi-pitt / dbmi-annotator

based on annotator.js, an annotation framework enable user account and annotation permission management and templating annotation plugin in biomedical domain.
Apache License 2.0
4 stars 5 forks source link

Review pdf2xml, crossref and CCC RightsLink approaches to providing full text for text mining #202

Open rkboyce opened 7 years ago

rkboyce commented 7 years ago

There has been considerable progress in the publishing community for supporting text mining of full text articles. We need to consider how these are relevant for the current NLM R01 future and to further enhancements to AnnotationPress. Here are some things to pay attention to:

1) Crossref provides an API (https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md) that is oriented towards helping identify the rights for full text and even the location of PDF or XML documents: https://www.youtube.com/watch?v=LBYgq6jPoyk&feature=youtu.be. There is some important background info on crossref here: https://www.youtube.com/watch?v=YPCRfNFJgj8

2) RightFind is the copyright clearance center's new solution for helping researchers find XML versions of full text for text mining purposes, along with knowledge of the rights they have to work with those documents: https://www.youtube.com/watch?v=-gUhAkwZbVQ

3) pdf2xml seems to be a highly preferred approach by the text mining community for working with PDF content. We need to think about how annotations created in AnnotationPress using PDF documents can be translated to the equivalent XML versions of the documents because that will be very useful for text miners.