MorphDiv / TeDDi_sample

Text Data Diversity Sample (TeDDi Sample)
Other
5 stars 3 forks source link

UDHR formatting #1

Closed bambooforest closed 4 years ago

bambooforest commented 4 years ago

Relevant Texts: All UDHR (Universal Declaration of Human Rights) translations

ToDo: Bring all of these into the same format. The UDHR is organized by articles, and this structure should be kept by using the line tag followed by a tab, and the text of the article. The whole of the preamble can follow a line tag . This formatting of UDHR translations should probably be done automatically, since these are several dozens of languages.

bambooforest commented 4 years ago

@tsamardzic @christianbentz -- tagging this "for discussion". my parser treats UDHR as plain text and inserts the lines to the lines table. if we decide to tag this with tags, then i will have to update the parser. i'd like to discuss whether or not we need these labels in the input data -- e.g. what will be their purpose? will they be used for searching? or dropping data? at the moment i would keep it as is and if want tags, then insert them in the XML output or the text output later.

bambooforest commented 4 years ago

closed by #180