Closed christianbentz closed 4 years ago
I cannot merge this request since there was an error while running the DB:
ValueError: Line contains tab, but doesn't match PBC or Grammar format
I think there is some problem in files from myp_con_2.txt to myp_con_9.txt. The other ones seem fine. @christianbentz
I think these issues will be a bit difficult to explain in detail until you've encountered them several times, @ximenina , which will make it hard on the data entry people to solve the issues.
The error is raised here:
https://github.com/uzling/100LC/blob/master/Database/clc/bodies.py#L20-L23
@tsamardzic -- we made a decision that "Hand-added grammars contain some number of annotation levels beginning with
Because @christianbentz split a single file into many, the second file starts with "
This will fix the problem.
The alternative is to change the database loading code, but then that 1) goes against what we decided for this file format, and 2) may introduce even more bugs, since you allow more possibilities.
Ok, this makes sense. I could actually change the line numbering manually, it is only a couple hundred lines overall. I will do this and give it another try.
@ximenina Does it now work?
@bambooforest @christianbentz The original corpus (input to my script) is a single file. I actually keep the sae numbering as in the original. Changing the numbering will remove direct reference to the source, but this is not a problem, as the mappings can be relatively easily reconstructed. Also, we said that direct reference to the original is not our priority.
Ximena is probably on her way to the office. She should be available again after 15h.
It's one file on GitHub, but the data are from different sources, e.g. the first recording is from the early 1970s and later transcribed by Everett. The second is recorded by Everett in 2000. The third taken from a documentary, etc. I agree with @christianbentz it makes sense to split them. The file and it's numbering were probably generated from a script that read in the different sources and put them into the same format anyway. :)
I ran the DB, no problem with these new files.
Separated myp_nfi_1 into separate files and added headers. Folder structure inside Piraha_myp also changed.