IllDepence / unarXive

A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
MIT License
259 stars 19 forks source link

Is there any efficient way to retrieve the OpenAlex label in the IMRaD set? #18

Closed SVLwoof closed 1 year ago

SVLwoof commented 1 year ago

Hi there, I'm interested in classifying the IMRaD dataset and using the OpenAlex ID for each of the entries, like the ones in citrec. Is there any efficient way to retrieve the label for each of them?

Thanks 🙂

IllDepence commented 1 year ago

Hi, thanks for the question. :)

I’m not sure I understand what information your looking for. Maybe it helps if I lay out my understanding of things and than you can comment on that.

IDs of cited documents
In the citrec data, the OpenAlex IDs given are IDs for cited documents. More specifically, the cited document corresponding to the single marker specified in each sample (see data set card). Example: Paper A contains the paragraph “We compare our model to B [1] and C [2]”, then the citrec data would contain two samples with the paragraph. One for marker “[1]” with the OpenAlex ID of paper B, and one for maker “[2]” with the OpenAlex ID of paper C. In the IMRaD data, a sample does not have a specified citation maker (many paragraphs will contain no citation markers at all), so I there is no equivalent to the OpenAlex IDs in the citrec data.

IDs of “source documents” If you’re interested in where the paragraphs originate from, both citrec and IMRaD data come with a file license_info.jsonl (see here and here), which show for each paragraph from which arXiv document it was extracted. These “source documents” are specified by their arXiv ID.

SVLwoof commented 1 year ago

Thanks for the reply 🙃 It means I got it wrong. What I'd initially thought is that the label in citrec is tied to the source text, and not to the cited documents.