Closed FredericBlum closed 2 years ago
I will probably need to add glottocode and filename to each ph_ID
and wd_ID
in order to avoid any ambiguity. Should be easy through concatenation while iterating through the rows.
Or would a numeric ID be preferred? E.g. ph_+=1
and wd_+=1
I think numeric would be ok, but we should take care to create these numbers replicably, I.e. always sort things explicitly before looping.
Frederic Blum @.***> schrieb am Di., 30. Aug. 2022, 17:49:
Or would a numeric ID be preferred? E.g. ph+=1 and wd+=1
— Reply to this email directly, view it on GitHub https://github.com/cldf-datasets/doreco/issues/8#issuecomment-1231851272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKHQXK6Q2AMIHHXXN43V3YUPRANCNFSM6AAAAAAQAQ33WY . You are receiving this because you were mentioned.Message ID: @.***>
I have now added a sorted import of the files. However, I think it's probably still better to have the glottocode within the ID, so we don't relabel stuff from the ground. Could you revise both the create_raw.py
and the cldfbench script? I would also be curious on some general feedback on the code in the former, if you have the time.
Yep, will have a look later today.
Frederic Blum @.***> schrieb am Mi., 14. Sep. 2022, 14:49:
I have now added a sorted import of the files. However, I think it's probably still better to have the glottocode within the ID, so we don't relabel stuff from the ground. Could you revise both the create_raw.py and the cldfbench script? I would also be curious on some general feedback on the code in the former, if you have the time.
— Reply to this email directly, view it on GitHub https://github.com/cldf-datasets/doreco/issues/8#issuecomment-1246717716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKFT2AYXGOAQKISGIQ3V6HCWFANCNFSM6AAAAAAQAQ33WY . You are receiving this because you were mentioned.Message ID: @.***>
@Tarotis I'll push a pull request later today, basically a bit of refactoring of your code.
Generally, what I'd like to do is
ExampleTable
, i.e. aggregating the DoReCo data on sentence level into - ideally glossed - IGT sentencesphones.csv
into words.csv
, by adding a list-valued phones
column to words (thereby getting rid of the rather large phones.csv
(in terms of rows) which makes processing slow)Oh, and I would like to add a MediaTable
linking the lexical data to the audio files.
If we were to merge phones
and csv
, the following columns would include a list:
start, end, ph, ph_ID, duration
Also, most analysis with DoReCo so far are phone-based. While the extraction of those lists is easy if you know how it is done, I fear that it might not be easy to apply for users with less computational exposure.
MediaTable and ExampleTable sound interesting, but I am not too sure how they would look. Is that something you intend to implement in the upcoming weeks?
Just checked: words.csv.zip is just 24MB, so I guess we can keep words.csv and phones.csv mostly as they are now. And I agree that lists in multiple columns that relate to each other make things hard to understand and use.
Many of the ID's (filenames, speakers) have one of two, or both problems:
a) Their ID's are not unique b) They are referenced with different names in different tables (e.g. filenames with or without prefixes)
I need to go through the data and make sure that the ID's are unique and referenced with identical names.
@xrotwang Did I understand correct, that
cldf.add_foreign_key
adds a lookup for column A of table 1, against column B of table 2?So this code maps the Language of ValueTable against ID of LanguageTable, making the information from this file available for retrieval when loading the CLDF metadata?