Open davanstrien opened 2 years ago
Some additional information on the annotation format here https://fedora.clarin-d.uni-saarland.de/rsc/annotation.html
@davanstrien, there seems to be a problem with the data
</page>
<page no="page_0002" id="1070">
hands NNS hand hands 2.09 0.65 1.12 1.50
. SENT . . 2.39 0.29 3.09 4.43
</s>
The <s>
tags are supposed to be inisde <page>
tags, but as you can see above, the last line is a closing <s>
tag without the opening one. As a result, both xml
and `bs4' fail to parse it properly
I changed it to look like this:
our PP$ our our 6.09 0.40 5.53 4.47
hands NNS hand hands 2.09 0.65 1.12 1.50
. SENT . . 2.39 0.29 3.09 4.43
</s>
</page>
<page no="page_0002" id="1070">
and then it parses that part correctly
I'll take a quick look at this today. One option might be to use a slightly cruder approach to parsing. I'll play around a bit and let you know how I get on with that.
I was considering using a stack, but in the case of malformed data, the sentences would be wrong.
A URL for this dataset
https://fedora.clarin-d.uni-saarland.de/rsc/
Dataset description
This offers an interesting dataset of text from the scientific domain across a long time period (1665-1869). Additionaly the dataset contains a range of annotations:
Dataset modality
Text
Dataset licence
Creative Commons Attribution Non Commercial Share Alike 4.0 International
Other licence
No response
How can you access this data
As a download from a repository/website
Confirm the dataset has an open licence
Contact details for data custodian
j.knappen@mx.uni-saarland.de