IllDepence / unarXive

A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
MIT License
259 stars 19 forks source link

The error in paper structure #19

Open Ma-Yongqiang opened 1 year ago

Ma-Yongqiang commented 1 year ago

This dataset is very helpful for NLP research in the scientific domain.

When I checked the parsed paper structure, I found some errors in the aspect of the paper structure. For the paper "2212.00253" in this dataset, the subsection "Deep Reinforcement Learning" is actually in section 2. However, the parsed result shows that the subsection "Deep Reinforcement Learning" is in section 1.

image

the section information in pdf file: image

The reason might be that the section 2 head text "BACKGROUND" does not have the sub-paragraph, which is lost in the tex file process.

IllDepence commented 1 year ago

Thank you for the input.

I took a look at the LaTeX source of the paper and saw that section 1 is created using a template specific setup: \IEEEraisesectionheading{\section{Introduction}}

I could imagine that this trips up the LaTeX parsing.

Are the paragraphs that follow continuously numbered (1, 2, 3, ...) or does is stay 1 throughout the paper?

Ma-Yongqiang commented 1 year ago

the following paragraphs are continuously numbered (1, 2, 3, ...).

2212.00253.json.txt