IllDepence / unarXive

A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
MIT License
259 stars 19 forks source link

How to separate the context sentences and the main citation sentence? #8

Closed fishiu closed 2 years ago

fishiu commented 2 years ago

Hi,

This is an issue about the structure of the context.csv:

It seemed that the context.csv put the context sentences and the main citation sentence together without any delimiters, but I want to do some experiments which need to separate and encode the sentences respectively.

By the way, do all the context string include three sentences? What if the main citation sentence is the first or last sentence?

IllDepence commented 2 years ago

Hi,

yes, the sentences are saved without a seperator in the context.csv shipped with the data set.

In general, you can use the extract_contexts.py script that comes with the data set to create your own custom context.csv exports (and e.g. specify the window size in terms of the number of preceding and succeeding sentences or words).

To extract contexts with seperated sentences, you can make a small modification to line 161 of extract_contexts.py.
Replacing
return ' '.join(sentences)
with e.g.
return '<SEP>'.join(sentences)
or any other type of seperator token you want to specify should do the job.

As for sentences at the beginning or end of a document, window sizes are smaller because there is no preceding/succeeding sentence. E.g. for a window of <sentence><citing_sentence><setences> you would then get <citing_sentence><setences> or <sentence><citing_sentence> respectively.