论文出版 - Githubissues

wanghaisheng commented 6 years ago

keywords scientific publications

employ a conservative color palette and a limited amount of visual gimmickry.The publications are, for the most part, carefully written and their layouts are designed by professionals working for different publishers. The typical tables in such publications have multiple header rows that are associated with the entire body of the table

https://github.com/elifesciences/sciencebeam-gym

wanghaisheng commented 6 years ago

Overview of related projects.

Meta Projects

Use the multiple tools to produce output (as is ScienceBeam itself).

PKP OTS

Input: Doc (Word)
Output: JATS XML
Scope: References
Activity: Active
License: GPL 3.0

Links:

Pandoc

Input: (m)any markup
Output: (m)any markup

Links:

project

Semantic Extraction Projects

GROBID (GeneRation Of BIbliographical Data)

Input: PDF
Output: TEI XML
Model: Hiearchical CRF (ML)
Scope: Full Text
Language: Java
Activity: Active
License: Apache 2.0

Links:

CERMINE (Content ExtRactor and MINEr)

Input: PDF
Output: TEI XML
Model: SVM (ML)
Scope: Full Text
Language: Java
Activity: Active
License: AGPL 3.0

Links:

GitHub

ContentMine pdf2xml

Input: SVG (PDF converted using https://bitbucket.org/petermr/pdf2svg)
Output: XML
Model: Rule based?
Scope: Full Text (focus on Tables?), a number of sub-projects
Language: Java
Activity: Active
License: Apache 2.0

Links:

Bitbucket

OCR++

Input: PDF
Output: TEI XML
Model: CRF (ML)
Scope: Full Text
Language: Python
Activity: ~2017
License:

Links:

PdfAct

Input: PDF
Output: XML
Model:
Scope: Paragraph?
Language: Java
Activity: Active
License: Apache 2.0

Links:

meTypeset

Input: Word .docx
Output: JATS XML (TEI as intermediate format)
Model: Rule based?
Scope: Full Text?
Language: XSLT
Activity: Active
License: GPL 2.0

Links:

GitHub

im2markup

Input: PDF
Output: LaTeX
Model: Computer Vision
Scope: Formulas
Language: Python
Activity: 2016
License: Apache 2.0

Links:

LA-PDFText

Input: PDF
Output: JATS? XML
Model: Rule based
Scope: Full Text
Language: Java
Activity:
License: GPL 3.0

Links:

PDFX

Input: PDF
Output: JATS XML
Model:
Scope: Full Text
Activity:
License:

Links:

"PDFX: Fully-automated PDF-to-XML Conversion of Scientific Literature" (2013)

ParsCit

Input: PDF
Output: JATS XML
Model: CRF
Scope: References
Activity: ~2013
License: LGPL 3.0

Links:

Crossref pdf-extract

Input: PDF
Output: XML
Model:
Scope: References, Regions
Activity: ~2015 (retired)
License: MIT

Links:

Low-level PDF Extraction

OCR

Resources

Semantic Extraction Resources

Reading Order Resources

Object Detection / Image Segmentation

More

wanghaisheng commented 6 years ago

healthcare and medical literatures

https://www.research.manchester.ac.uk/portal/files/34231319/FULL-TEXT.PDF https://www.slideshare.net/nikolamilosevic86/extracting-patient-data-from-tables-in-clinical-literature

Extracting patient data from tables in clinical literatures.pdf Case study on extraction of BMI, weight and number of patients

Current biomedical text mining efforts are mostly focused on extracting information from the body of researcharticles. However, tables contain important information such as key characteristics of clinical trials. Here, weexamine the feasibility of information extraction from tables. We focus on extracting data about clinicaltrial participants. We propose a rule-based method that decomposes tables into cell level structures and thenextracts information from these structures. Our method performed with a F-measure of 83.3% for extraction ofnumber of patients, 83.7% for extraction of patient’s body mass index and 57.75% for patient’s weight. Theseresults are promising and show that information extraction from tables in biomedical literature is feasible

Hybrid methodology for information extraction from tables in the biomedical literature paper:https://s3.amazonaws.com/academia.edu.documents/46796285/BElBiPaper.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1529098358&Signature=n1FdoBZtamndufAQchtkUwC2DCk%3D&response-content-disposition=inline%3B%20filename%3DHybrid_methodology_for_information_extra.pdf

Abstract. Scientific literature, especially in the biomedical domain, is growing exponentially. Text mining can provide methods and tools that can help professionals to handle large amount of literature. However, most of the current approaches focus on the textual body of the article, usually ignoring tables and figures. In this paper, we present a hybrid methodology that utilizes machine learning and set of heuristics rules for information extraction from tables in literature. In a case study, the method achieved F1-score of 83.94% for extracting the number of patients with the names of participant groups from clinical trial publications.

Towards Computational Extraction of Potential Drug-Drug Interaction Information from Drug Product Labeling Tables

http://faculty.dbmi.pitt.edu/cosbbi/cosbbi2016/AMIA%20Final%20Draft%20-%20031016-SD-JS-RDB.pdf

Structured Product Labels (SPLs) are mandated to provide information about potential drug-drug interactions(PDDIs). A major limitation of SPLs is that the information is provided as unstructured text and tables. Extracting,storing, processing, and annotating this information into an indexed knowledge base will allow increased accessibilityto important prescribing information. In this paper we report on an analysis of the feasibility of automaticallyextracting PDDI information from tables found within the Drug Interactions section of SPLs. 1,161 SPLs (3.9% ofprescription labeling) had a total 1,530 tables. These tables had 340 headers that we grouped into 8 categories.Both functional and structural analyses, and MetaMap annotation, was completed for 50% of the 1,530 tables. Weobserved that most tables were diversely structured, and that the most frequent semantic types aligned well with the8 table header categories. The results provide a starting point for developing heuristics to extract PDDIs

Clinical information extraction applications: A literature review

https://www.sciencedirect.com/science/article/pii/S1532046417302563

Abstract
Background
With the rapid adoption of electronic health records (EHRs), it is desirable to harvest information and knowledge from EHRs to support automated systems at the point of care and to enable secondary use of EHRs for clinical and translational research. One critical component used to facilitate the secondary use of EHR data is the information extraction (IE) task, which automatically extracts and encodes clinical information from text.

Objectives
In this literature review, we present a review of recent published research on clinical information extraction (IE) applications.

Methods
A literature search was conducted for articles published from January 2009 to September 2016 based on Ovid MEDLINE In-Process & Other Non-Indexed Citations, Ovid MEDLINE, Ovid EMBASE, Scopus, Web of Science, and ACM Digital Library.

Results
A total of 1917 publications were identified for title and abstract screening. Of these publications, 263 articles were selected and discussed in this review in terms of publication venues and data sources, clinical IE tools, methods, and applications in the areas of disease- and drug-related studies, and clinical workflow optimizations.

Conclusions
Clinical IE has been used for a wide range of applications, however, there is a considerable gap between clinical studies using EHR data and studies using clinical IE. This study enabled us to gain a more concrete understanding of the gap and to provide potential solutions to bridge this gap.

wanghaisheng commented 6 years ago

paper：https://arxiv.org/pdf/1609.06423.pdf code:https://github.com/ocrplusplus/ocrplusplus

OCR++: A Robust Framework For Information Extraction from Scholarly Articles

This paper proposes OCR++, an open-source framework designed for a variety of information extraction tasks from scholarly articles including metadata (title, author names, affiliation and e-mail), structure (section headings and body text, table and figure headings, URLs and footnotes) and bibliography (citation instances and references). We analyze a diverse set of scientific articles written in English language to understand generic writing patterns and formulate rules to develop this hybrid framework. Extensive evaluations show that the proposed framework outperforms the existing state-of-the-art tools with huge margin in structural information extraction along with improved performance in metadata and bibliography extraction tasks, both in terms of accuracy (around 50% improvement) and processing time (around 52% improvement). A user experience study conducted with the help of 30 researchers reveals that the researchers found this system to be very helpful. As an additional objective, we discuss two novel use cases including automatically extracting links to public datasets from the proceedings, which would further accelerate the advancement in digital libraries. The result of the framework can be exported as a whole into structured TEI-encoded documents. Our framework is accessible online at http://www.cnergres.iitkgp.ac.in/OCR++/home

wanghaisheng commented 6 years ago

ICDAR2013 Competition on Historical Newspaper Layout Analysis – HNLA2013† http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf

Abstract—This paper presents an objective comparative evaluationof layout analysis methods for scanned historical newspapers.It describes the competition (modus operandi, datasetand evaluation methodology) held in the context ofICDAR2013 and the 2nd International Workshop on HistoricalDocument Imaging and Processing (HIP2013), presenting theresults of the evaluation of five submitted methods. Two stateof-the-artsystems, one commercial and one open-source, arealso evaluated for comparison. Two scenarios are reported inthis paper, one evaluating the ability of methods to accuratelysegment regions and the other evaluating the whole pipeline ofsegmentation and region classification (with a text extractiongoal). The results indicate that there is a convergence to a certainmethodology with some variations in the approach. However,there is still a considerable need to develop robust methodsthat deal with the idiosyncrasies of historical newspapers.

wanghaisheng commented 6 years ago

https://github.com/mcs07/ChemDataExtractor

ChemDataExtractor (Swain & Cole 2016) recently presented a method for information extraction of chemical entities from literature that is able to process both text and tables. It focuses only on tables where data about a chemical entity is in one row, utilizing a rule-based parsing grammar tailored for extracting certain properties. Extracted data is mapped into a predefined data model. The overall results range from 85% F1-score to 92% for various sub-tasks. However, no results for information extraction from tables only have been reported. The methodology is also limited to the pre-described type of simple tables although it can extract information from XML, HTML and PDF documents.

data-liberation / data-liberation-resources

论文出版 #3

Meta Projects

PKP OTS

Pandoc

Semantic Extraction Projects

GROBID (GeneRation Of BIbliographical Data)

CERMINE (Content ExtRactor and MINEr)

ContentMine pdf2xml

OCR++

PdfAct

meTypeset

im2markup

LA-PDFText

PDFX

ParsCit

Crossref pdf-extract

Low-level PDF Extraction

OCR

Resources

Semantic Extraction Resources

Reading Order Resources

Object Detection / Image Segmentation

More