MorphDiv / TeDDi_sample

Text Data Diversity Sample (TeDDi Sample)
Other
5 stars 3 forks source link

Split piraha file myp_nfi_1 into sep. files and changed folder structure #233

Closed christianbentz closed 4 years ago

christianbentz commented 4 years ago

Separated myp_nfi_1 into separate files and added headers. Folder structure inside Piraha_myp also changed.

ximenina commented 4 years ago

I cannot merge this request since there was an error while running the DB: ValueError: Line contains tab, but doesn't match PBC or Grammar format

I think there is some problem in files from myp_con_2.txt to myp_con_9.txt. The other ones seem fine. @christianbentz

bambooforest commented 4 years ago

I think these issues will be a bit difficult to explain in detail until you've encountered them several times, @ximenina , which will make it hard on the data entry people to solve the issues.

The error is raised here:

https://github.com/uzling/100LC/blob/master/Database/clc/bodies.py#L20-L23

@tsamardzic -- we made a decision that "Hand-added grammars contain some number of annotation levels beginning with ", as noted in the comment in the code.

Because @christianbentz split a single file into many, the second file starts with "", the third with "", etc. Perhaps this is because your code reads in the files and puts them into one file? Each one looks like it's from a different source, so each file should start with "line_1", or "line_1.1" if you prefer, as long as it starts with "line_1".

This will fix the problem.

The alternative is to change the database loading code, but then that 1) goes against what we decided for this file format, and 2) may introduce even more bugs, since you allow more possibilities.

christianbentz commented 4 years ago

Ok, this makes sense. I could actually change the line numbering manually, it is only a couple hundred lines overall. I will do this and give it another try.

christianbentz commented 4 years ago

@ximenina Does it now work?

tsamardzic commented 4 years ago

@bambooforest @christianbentz The original corpus (input to my script) is a single file. I actually keep the sae numbering as in the original. Changing the numbering will remove direct reference to the source, but this is not a problem, as the mappings can be relatively easily reconstructed. Also, we said that direct reference to the original is not our priority.

Ximena is probably on her way to the office. She should be available again after 15h.

bambooforest commented 4 years ago

It's one file on GitHub, but the data are from different sources, e.g. the first recording is from the early 1970s and later transcribed by Everett. The second is recorded by Everett in 2000. The third taken from a documentary, etc. I agree with @christianbentz it makes sense to split them. The file and it's numbering were probably generated from a script that read in the different sources and put them into the same format anyway. :)

ximenina commented 4 years ago

I ran the DB, no problem with these new files.