cldf-datasets / doreco

CLDF dataset derived from DoReCo's core corpus
https://doreco.info/
3 stars 0 forks source link

Check uniqueness of ID's #8

Closed FredericBlum closed 2 years ago

FredericBlum commented 2 years ago

Many of the ID's (filenames, speakers) have one of two, or both problems:

a) Their ID's are not unique b) They are referenced with different names in different tables (e.g. filenames with or without prefixes)

I need to go through the data and make sure that the ID's are unique and referenced with identical names.

@xrotwang Did I understand correct, that cldf.add_foreign_key adds a lookup for column A of table 1, against column B of table 2?

So this code maps the Language of ValueTable against ID of LanguageTable, making the information from this file available for retrieval when loading the CLDF metadata?

        cldf.add_foreign_key('ValueTable', 'Language', 'LanguageTable', 'ID')
        cldf.add_foreign_key('ValueTable', 'Filename', 'metadata.csv', 'Filename')
FredericBlum commented 2 years ago

I will probably need to add glottocode and filename to each ph_ID and wd_ID in order to avoid any ambiguity. Should be easy through concatenation while iterating through the rows.

FredericBlum commented 2 years ago

Or would a numeric ID be preferred? E.g. ph_+=1 and wd_+=1

xrotwang commented 2 years ago

I think numeric would be ok, but we should take care to create these numbers replicably, I.e. always sort things explicitly before looping.

Frederic Blum @.***> schrieb am Di., 30. Aug. 2022, 17:49:

Or would a numeric ID be preferred? E.g. ph+=1 and wd+=1

— Reply to this email directly, view it on GitHub https://github.com/cldf-datasets/doreco/issues/8#issuecomment-1231851272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKHQXK6Q2AMIHHXXN43V3YUPRANCNFSM6AAAAAAQAQ33WY . You are receiving this because you were mentioned.Message ID: @.***>

FredericBlum commented 2 years ago

I have now added a sorted import of the files. However, I think it's probably still better to have the glottocode within the ID, so we don't relabel stuff from the ground. Could you revise both the create_raw.py and the cldfbench script? I would also be curious on some general feedback on the code in the former, if you have the time.

xrotwang commented 2 years ago

Yep, will have a look later today.

Frederic Blum @.***> schrieb am Mi., 14. Sep. 2022, 14:49:

I have now added a sorted import of the files. However, I think it's probably still better to have the glottocode within the ID, so we don't relabel stuff from the ground. Could you revise both the create_raw.py and the cldfbench script? I would also be curious on some general feedback on the code in the former, if you have the time.

— Reply to this email directly, view it on GitHub https://github.com/cldf-datasets/doreco/issues/8#issuecomment-1246717716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKFT2AYXGOAQKISGIQ3V6HCWFANCNFSM6AAAAAAQAQ33WY . You are receiving this because you were mentioned.Message ID: @.***>

xrotwang commented 2 years ago

@Tarotis I'll push a pull request later today, basically a bit of refactoring of your code.

Generally, what I'd like to do is

xrotwang commented 2 years ago

Oh, and I would like to add a MediaTable linking the lexical data to the audio files.

FredericBlum commented 2 years ago

If we were to merge phones and csv, the following columns would include a list: start, end, ph, ph_ID, duration

Also, most analysis with DoReCo so far are phone-based. While the extraction of those lists is easy if you know how it is done, I fear that it might not be easy to apply for users with less computational exposure.

MediaTable and ExampleTable sound interesting, but I am not too sure how they would look. Is that something you intend to implement in the upcoming weeks?

xrotwang commented 2 years ago

Just checked: words.csv.zip is just 24MB, so I guess we can keep words.csv and phones.csv mostly as they are now. And I agree that lists in multiple columns that relate to each other make things hard to understand and use.