Closed bambooforest closed 3 years ago
@tsamardzic @avonizos -- i'd like to clean up my remaining issues in the next few weeks. :)
now that the database is relatively stable, do we want to keep the reports that work via python on the raw data files or the reports that work directly on the database (i.e. via R)?
i'm ok with either, but i think the latter are easier to maintain in the future. if we go with the latter, i would remove the python scripts from the repo and go with the database/R code. and close this issue.
I would keep a Python script that can screen the text files in the repo. This script should NOT be used for extracting any counts for analyses, but only as a means of tracking progress at the pre-database stage. It should output for each language:
I guess Olga can update the script (probably next week):
ok, i'll leave this up to @avonizos to work on. please send me a PR when you're finished fore review.
I would keep a Python script that can screen the text files in the repo. This script should NOT be used for extracting any counts for analyses, but only as a means of tracking progress at the pre-database stage. It should output for each language:
* the number of non-empty folders * average number of files per non-empty folder * average number of lines per file
I guess Olga can update the script (probably next week):
* leave out (for later) all the tokenisation * calculate the two average numbers
Sounds good to me. I'll leave out the tokenisation counts from the repo and will try to improve a bit the line_counts.py from the issue #127 .
@avonizos -- are we ok after #196 to close this?
@avonizos -- are we ok after #196 to close this?
Wait a bit please, I want to check the folder "progress" and remove some files from there. I should be done with this by today evening.
@avonizos - I'd like to close this issue sooner than later. I will update the counts report and use the RData file instead of the SQLite object for consistency sake across the reports. I will also remove the comparison report, since it doesn't seem to matter. Please let me know when you've finished your part, so we can close this issue.
@avonizos - I'd like to close this issue sooner than later. I will update the counts report and use the RData file instead of the SQLite object for consistency sake across the reports. I will also remove the comparison report, since it doesn't seem to matter. Please let me know when you've finished your part, so we can close this issue.
Hi Steve! Don't you receive emails to your UZH account? I was done with the issue 2 weeks ago and asked you for help with pushing the changes, I had problems with creating a PR. Please tell me which email you use now, so that I can resend you that email (including the changed files). I'm a bit afraid to create a messy PR now, so I thought it would be better that you push the changes, and then I'll create a new fork.
Hmm, sorry I must have missed it. Another reason why I think it's better to discuss these issues/changes over the issue tracker in general, since it's the record and institutional memory of the repository. Usually you just have to pull in the upstream changes and push them to your branch before doing the PR. This can of course be problematic for these long standing month-long issues. I'll dig up the email.
Doh, I accidentally closed the browser without saving the detailed report in this issue. Here's what I can reconstruct from memory.
Two issues:
Abkhaz_abk
:https://github.com/uzling/100LC/blob/master/Corpus/Abkhaz_abk/professional/abk_pro_1.txt
which has one file with a white-spaced delimited writing system (Cyrillic) has different counts (nearly 10% difference) between the simple and advanced. Does advanced deal with punctuation somehow? Does it drop numerals (this is a UDHR file). I think whatever the reason(s), unless it's a bug, this should be made more explicit in the readme:
https://github.com/uzling/100LC/tree/master/Reports/progress/README.md
which does say that "the numbers are less reliable for Burmese, Chinese, Japanese, Korean and Thai" -- which is totally understandable -- those are the writing systems I would expect a high delta between simple (i.e. white space tokens as I understand it) and advanced (using NLP methods) of tokenization.
In the same vein, I'm not sure what the token counts mean in this report. For example, the reports report counts of
2650
and2860
respectively forAbjhaz_abk
:But as far as I can tell, there's no test of whether or not the
progress.py
file is counting correctly. For example, if I take the text of:https://github.com/uzling/100LC/blob/master/Corpus/Abkhaz_abk/professional/abk_pro_1.txt
and replace \s+ with \n, I get
1325
rows, i.e. tokens (including numerals, words with punctuation attached to words). This is exactly half of what is reported (just noticed that).In the database, I also get
1325
rows:SELECT * from word where line_id < 92
I suggest always having some tests in code in reports to make sure the output is correct, e.g. counting some of the files by hand and making sure the code dumps the right counts. Note also: #127.
progress.py
is generating token counts for languages that have multiple writing systems and collapsing them into one count, e.g.:These are two UDHR texts (in the same folder) for Vietnamese, but one is in Latin and the other in Hangul (a different issue is that there are two version of Hangul in the ISO writing system codes, i.e.
Kore
for mixed andHani
: https://en.wikipedia.org/wiki/ISO_15924).My gut feeling is that we don't want token counts per language root folder then, but per corpus type / writing system? Or do we only care about tokens across writing systems of the same "language"?
Languages with multiple writing systems in the corpus include (so far):
SELECT corpus_id, ISO_6393, writing_system from file group by ISO_6393, writing_system order by corpus_id
cmn | Hans cmn | Hant
hin | Deva hin | Latn
khk | Latn khk | Cyrl
kor | Hang kor | Kore
vie | Latn vie | Hani