MorphDiv / TeDDi_sample

Text Data Diversity Sample (TeDDi Sample)
Other
5 stars 3 forks source link

Issue with progress report #135

Closed bambooforest closed 3 years ago

bambooforest commented 4 years ago

Doh, I accidentally closed the browser without saving the detailed report in this issue. Here's what I can reconstruct from memory.

Two issues:

  1. It's not clear to me why some of the languages like Abkhaz_abk:

https://github.com/uzling/100LC/blob/master/Corpus/Abkhaz_abk/professional/abk_pro_1.txt

which has one file with a white-spaced delimited writing system (Cyrillic) has different counts (nearly 10% difference) between the simple and advanced. Does advanced deal with punctuation somehow? Does it drop numerals (this is a UDHR file). I think whatever the reason(s), unless it's a bug, this should be made more explicit in the readme:

https://github.com/uzling/100LC/tree/master/Reports/progress/README.md

which does say that "the numbers are less reliable for Burmese, Chinese, Japanese, Korean and Thai" -- which is totally understandable -- those are the writing systems I would expect a high delta between simple (i.e. white space tokens as I understand it) and advanced (using NLP methods) of tokenization.

In the same vein, I'm not sure what the token counts mean in this report. For example, the reports report counts of 2650 and 2860 respectively for Abjhaz_abk:

But as far as I can tell, there's no test of whether or not the progress.py file is counting correctly. For example, if I take the text of:

https://github.com/uzling/100LC/blob/master/Corpus/Abkhaz_abk/professional/abk_pro_1.txt

and replace \s+ with \n, I get 1325 rows, i.e. tokens (including numerals, words with punctuation attached to words). This is exactly half of what is reported (just noticed that).

In the database, I also get 1325 rows:

SELECT * from word where line_id < 92

I suggest always having some tests in code in reports to make sure the output is correct, e.g. counting some of the files by hand and making sure the code dumps the right counts. Note also: #127.

  1. The progress.py is generating token counts for languages that have multiple writing systems and collapsing them into one count, e.g.:

These are two UDHR texts (in the same folder) for Vietnamese, but one is in Latin and the other in Hangul (a different issue is that there are two version of Hangul in the ISO writing system codes, i.e. Kore for mixed and Hani: https://en.wikipedia.org/wiki/ISO_15924).

My gut feeling is that we don't want token counts per language root folder then, but per corpus type / writing system? Or do we only care about tokens across writing systems of the same "language"?

Languages with multiple writing systems in the corpus include (so far):

SELECT corpus_id, ISO_6393, writing_system from file group by ISO_6393, writing_system order by corpus_id

cmn | Hans cmn | Hant

hin | Deva hin | Latn

khk | Latn khk | Cyrl

kor | Hang kor | Kore

vie | Latn vie | Hani

bambooforest commented 4 years ago

@tsamardzic @avonizos -- i'd like to clean up my remaining issues in the next few weeks. :)

now that the database is relatively stable, do we want to keep the reports that work via python on the raw data files or the reports that work directly on the database (i.e. via R)?

i'm ok with either, but i think the latter are easier to maintain in the future. if we go with the latter, i would remove the python scripts from the repo and go with the database/R code. and close this issue.

tsamardzic commented 4 years ago

I would keep a Python script that can screen the text files in the repo. This script should NOT be used for extracting any counts for analyses, but only as a means of tracking progress at the pre-database stage. It should output for each language:

I guess Olga can update the script (probably next week):

bambooforest commented 4 years ago

ok, i'll leave this up to @avonizos to work on. please send me a PR when you're finished fore review.

olgapelloni commented 4 years ago

I would keep a Python script that can screen the text files in the repo. This script should NOT be used for extracting any counts for analyses, but only as a means of tracking progress at the pre-database stage. It should output for each language:

* the number of non-empty folders

* average number of files per non-empty folder

* average number of lines per file

I guess Olga can update the script (probably next week):

* leave out (for later) all the tokenisation

* calculate the two average numbers

Sounds good to me. I'll leave out the tokenisation counts from the repo and will try to improve a bit the line_counts.py from the issue #127 .

bambooforest commented 4 years ago

@avonizos -- are we ok after #196 to close this?

olgapelloni commented 4 years ago

@avonizos -- are we ok after #196 to close this?

Wait a bit please, I want to check the folder "progress" and remove some files from there. I should be done with this by today evening.

bambooforest commented 4 years ago

@avonizos - I'd like to close this issue sooner than later. I will update the counts report and use the RData file instead of the SQLite object for consistency sake across the reports. I will also remove the comparison report, since it doesn't seem to matter. Please let me know when you've finished your part, so we can close this issue.

olgapelloni commented 4 years ago

@avonizos - I'd like to close this issue sooner than later. I will update the counts report and use the RData file instead of the SQLite object for consistency sake across the reports. I will also remove the comparison report, since it doesn't seem to matter. Please let me know when you've finished your part, so we can close this issue.

Hi Steve! Don't you receive emails to your UZH account? I was done with the issue 2 weeks ago and asked you for help with pushing the changes, I had problems with creating a PR. Please tell me which email you use now, so that I can resend you that email (including the changed files). I'm a bit afraid to create a messy PR now, so I thought it would be better that you push the changes, and then I'll create a new fork.

bambooforest commented 4 years ago

Hmm, sorry I must have missed it. Another reason why I think it's better to discuss these issues/changes over the issue tracker in general, since it's the record and institutional memory of the repository. Usually you just have to pull in the upstream changes and push them to your branch before doing the PR. This can of course be problematic for these long standing month-long issues. I'll dig up the email.