Import frequencies from Corpus NGT into frequency fields

ocrasborn commented 9 years ago

Frequency of annotation_idgloss in EAF files of Corpus NGT, in SVN repo at /scratch2/svnrepos/signlang/corpus-ngt/
Overall frequency and frequency per region, plus for each the number of different Participants for which the gloss is found
Overlapping glosses on the two hands (tiers GlossR and GlossL) for a Participant with the same value should only be counted once (i.e.: two-handed signs)
Frequencies should be regularly updated (overnight?)
See #56: overall frequency fields still need to be added to the model

ocrasborn commented 9 years ago

If this is not already done in the process: the table with frequencies should also be stored as a text file in a place where outsiders cannot see it, but where we can access it through the terminal.

Woseseltops commented 8 years ago

I think the tool svn cat can be used to read the contents of files in the svn repo. http://stackoverflow.com/questions/10366800/read-content-of-a-file-off-a-remote-svn-repo-without-checking-it-out-locally

@ocrasborn can you show me around this corpus-ngt repo some time, and show me what should be read from what file?

Woseseltops commented 8 years ago

Shortly after writing the above comment, I read the comments in #69 where you ask for a checkout of the repository instead of the repo itself, so forget above svn cat. For this issue, we'll assume the contents of the corpus-ngt repo is readable like a normal a directory.

ocrasborn commented 8 years ago

@Woseseltops , @vanlummelhuizen once wrote perl scripts to count signs based on glosses, maybe it would be quickest if he chips in here, as he can read and write EAF structure with his eyes closed by now.

vanlummelhuizen commented 8 years ago

@ocrasborn I remember I uploaded some scripts to the RU sign language wiki. Is the script you refer to there? If so, perhaps you could reinstate my account. Otherwise I will dive into my archive.

ocrasborn commented 8 years ago

We're kind of phasing out the wiki, so let's not bother. Here's the description from the wiki that @richardbank wrote. The perl script is attached below. Also attached: the table with metadata that specifies for each participant from which region s/he comes.

-0-0-0-0-0-0-0- In ELAN kun je alleen glossen tellen per L- of R-tier, en geen gebaren. Dit script loopt per gebaarder beide Glos-tiers af en kijkt of de inhoud van een glos op de ene tier gelijk is aan die van een glos op de andere tier (gelijktijdig wijzen naar twee locaties wordt helaas dus opgevat als één PT-gebaar).

Het script neemt 1 argument, en dat is een tekstbestand met een lijst van alle annotatiebestanden die je wilt tellen, inclusief het volledige pad. Dus: D:\Corpus-NGT\eaf\CNGT0000-CNGT0099\CNGT0001.eaf D:\Corpus-NGT\eaf\CNGT0000-CNGT0099\CNGT0002.eaf ...etc.

Het script zelf roep je dan als volgt aan: perl signCounter.pl eafs.txt > result.txt

waarbij eafs.txt de lijst met eaf-files is en result.txt de output. Deze output is een tabgescheiden bestand dat je verder in Excel kunt sorteren en optellen, en bevat de tellingen per glos. De output op het scherm geeft het totaal aantal voorkomens (#tokens) van het totaal aantal gebaren (#types), en hoeveel daarvan maar een keer voorkomen (#singletons). -0-0-0-0-0-0-0-

signCounter.pl.zip CNGT_MetadataEnglish_OtherResearchers.xlsx.zip

ocrasborn commented 8 years ago

...waarbij ik me nog afvraag hoeveel overlap nodig is om een gebaar als tweehandig te beschouwen (in plaats van eerst het gebaar op de rechterhand, daarna hetzelfde gebaar met de linkerhand -- wat ook voorkomt). D.w.z., ik weet niet meer hoe het huidige script rekent, en het staat ook niet in de beschrijving. Wat mij betreft: een of meer frames overlap (min. 40 millisec.) = 1 gebaar.

vanlummelhuizen commented 8 years ago

@ocrasborn The script now only detects overlap, not the amount of overlap. I will try to implement te 40 ms overlap.

vanlummelhuizen commented 8 years ago

I have made some improvements/adjustments to the script. It now outputs the frequencies per region and the amount of overlap is now a command line argument. The participant metadatafile is a tab separated values file I will attach here (see the attached zip).

signCounter_v2.pl <file with list of EAF files> <participant metadatafile> <minimal overlap in milliseconds>

Output is given in a tab separated values format: Gloss, overall frequency, region, frequency for that region, region, frequency for that region etc.

A summary, number of tokens, number of types and number of singletons, are outputted to stderr.

This script needs some testing. @ocrasborn, do you have some data to do that? signCounter_v2.zip

vanlummelhuizen commented 8 years ago

Rewrote the script in Python. @ocrasborn Heads up: annotation values in an sequence of subsequently overlapping (>= 40 ms) annotations are turned in a unique set of values. E.g. if there is one long PALM-UP on the left hand and there are 2 PALM-UPs on the right hand, only one PALM-UP ends up in the statistics. Note that this is also the case if e.g. there are two PALM-UPs on the left hand that are 'connected' by another annotation on the right hand by overlapping the two PALM-UPs. I remember discussing this, but since that was some time ago and you handed me the script you are using now (right?), I thought you should know. signCounter.py.zip

vanlummelhuizen commented 8 years ago

@Woseseltops This is the script I mentioned to you before in person: https://github.com/vanlummelhuizen/CNGT-scripts/blob/master/python/signCounter.py

It uses a metadata file I will send you by email (cause I am not sure this may be viewed by others).

Usage: signCounter.py -m <metadata file> -o <mimimum overlap> <file|directory ...>

The output is in JSON:

{
    "#L": {
        "frequenciesPerRegion": [
            {
                "Amsterdam": 1
            }
        ], 
        "frequency": 1, 
        "numberOfSigners": 1
    }, 
    "#O": {
        "frequenciesPerRegion": [
            {
                "Amsterdam": 1
            }
        ], 
        "frequency": 1, 
        "numberOfSigners": 1
    }, 
    ...
}

Woseseltops commented 8 years ago

Already added a number of prerequisites, but couldn't get the script imported... seems like an Apache caching problem. I believe the webservers are rebooted every night, so let's see if it will recognize the CNGT-scripts repo tomorrow.

Woseseltops commented 8 years ago

Okay, I lost a morning to the fact that the script by @vanlummelhuizen didn't want to be imported by Apache, but I think it works! The problem was a misleading error message: Python said the module was not there (which made me think it was looking at the wrong place, or was caching stuff), but in reality Apache had no permissions to look there. Fixing the permissions fixed the problem.

I then wrote a view that interprets the result of the script, and saves it into the Signbank database. This, takes a minute or two, but don't let this stop you from trying.

http://signbank.science.ru.nl/dictionary/update_cngt_counts/

Questions/remarks:

While working with it, I had to make two small changes to the script of @vanlummelhuizen. I put these in a pull request because pull requests are cool: https://github.com/vanlummelhuizen/CNGT-scripts/pull/1
@vanlummelhuizen, the result per region has this format [{'regionA':1},{'regionB':2}]. Why not {'regionA':1, 'regionB':2}? I translate from the former to the latter before processing anyway.
As can be seen in the link, quite a number of glosses are not in Signbank. Is this expected?
As EAF directory, i picked eaf/CNGT0000-CNGT0099. Which one should I really use? All of them?

To do:

I'm not sure whether the CNGT repo in use is kept up to date.
I think view should be visited automatically every night, right?

vanlummelhuizen commented 8 years ago

@Woseseltops Answers to your questions/remarks (in order of appearance):

I gave you a pull request back right away. Because they are apparently cool. (I messed up one comment message along the way, so if you stumble upon identical messages, blame me)
I changed the structure of the result per region.
I would say so. @ocrasborn will know for sure.
I cannot imagine you shouldn't use all of them.

ocrasborn commented 8 years ago

@Woseseltops @vanlummelhuizen : indeed, it is expected that not all glosses appear in the corpus. (incidentally, @vanlummelhuizen , this is why we changed cngt.ecv to ngt.ecv at some point). Although the Corpus NGT is our largest dataset, it's not the only one, and NGT Signbank should ideally include all the words of the language, whatever their source.

ocrasborn commented 8 years ago

Something goes wrong in the script, @Woseseltops / @vanlummelhuizen : the number of signers is counted overall, but not per region. The fields "Number of signers in [Amsterdam, Rotterdam, ...]" are always empty now.

Woseseltops commented 8 years ago

I changed the structure of the result per region.

Okay, I pulled your changes and updated my script

I cannot imagine you shouldn't use all of them.

Okay, the view now gets all EAF files. Because this takes quite a few minutes, I've added a functionality where you can select which folder you want to process. For example:

http://signbank.science.ru.nl/dictionary/update_cngt_counts/6

will only do the 6th folder. When no number is present, all folders will be processed.

indeed, it is expected that not all glosses appear in the corpus

I think you answered the opposite question, @ocrasborn :). My question is, not all glosses found by the script of @vanlummelhuizen are also in Signbank. The output lists these glosses; click the link above for example.

the number of signers is counted overall, but not per region

Note: previously, I only processed the first EAF folder. Now that I've processed all folders, more fields are filled. Still, only the regional 'Number of occurrences' are filled, not the regional 'Number of signers'. If I'm not mistaken, the script does not output these?

Questions for @ocrasborn:

How often should the CNGT repo be updated?
How often should the counts in Signbank be updated?

Question for @vanlummelhuizen

Is it correct that the script does not output the number of of regional signers (as opposed to occurrences?)
The script crashes at the file CNGT0300-CNGT0399/CNGT0311.eaf, and possibly more, so I now have added an error catching line to your script that skips problem files. Could you have a look?

ocrasborn commented 8 years ago

The answer to the actual question, @Woseseltops , is also affirmative: there are many gloss annotations in the corpus that are not chosen from the ECV. (Main reason: we manually add diacritics to ECV values to mark false starts, uncertainty about an annotation, etc. Further: fingerspelling (#WESSEL) is used a lot and is not lexicalised, but because it is manual activity, we annotate it on the gloss tier. The answers to the new questions to me:

Once a night suffices for updating the CNGT repo.
Idem.

Woseseltops commented 8 years ago

The answer to the actual question, @Woseseltops , is also affirmative

Okay, in that case I'll stop worrying about it :)

Once a night suffices for updating the CNGT repo.

Turns out that was already happening! For completeness: this is about the checkout of the CNGT repo in the writable area accessable by the Signbank code, to be used for Signbank-related purposes.

Idem.

Alright, Applejack will now visit https://signbank.science.ru.nl/dictionary/update_cngt_counts/ every night. Please leave this issue open so I remember to check back whether its working.

ocrasborn commented 8 years ago

Right @Woseseltops , the updating (from changes we manually committed to the repo) goes fine. But what was not yet working is the committing of changes made by the server script that @vanlummelhuizen wrote to the repo, so that we can see those when we update our local copies in the morning. Are you still following? :-) (Micha's script made sure that the ECV links are applied to annotations that don't yet have them, and that changes in the ECV are applied to the annotation files.)

vanlummelhuizen commented 8 years ago

@Woseseltops I added number of signers per regions:

{
    "AANBELLEN": {
        "frequenciesPerRegion": {
            "Amsterdam": {
                "frequency": 4, 
                "numberOfSigners": 4
            }, 
            "Groningen": {
                "frequency": 5, 
                "numberOfSigners": 5
            }, 
            "Mixed": {
                "frequency": 1, 
                "numberOfSigners": 1
            }, 
            "Other": {
                "frequency": 2, 
                "numberOfSigners": 1
            }
        }, 
        "frequency": 12, 
        "numberOfSigners": 11
    }, 
    "AANBIEDEN": {
        "frequenciesPerRegion": {
            "Amsterdam": {
                "frequency": 4, 
                "numberOfSigners": 2
            }, 
            "Groningen": {
                "frequency": 13, 
                "numberOfSigners": 4
            }, 
            "Mixed": {
                "frequency": 2, 
                "numberOfSigners": 2
            }
        }, 
        "frequency": 19, 
        "numberOfSigners": 8
    }, 
...
}

Woseseltops commented 8 years ago

Please leave this issue open so I remember to check back whether its working.

Hmm... logs were written to a non-existing location :(. I fixed it now, please keep open so I remember to check later.

Are you still following? :-)

I think I do. There is no action point for me in your message, right?

I added number of signers per regions:

Looks good @vanlummelhuizen ! Before I pull, have you also looked at my second question? The run function now looks like this, to skip problem eaf files:

 def run(self):
     if len(self.all_files) > 0:
         for f in self.all_files:
            try:
                self.process_file(f)
                self.generate_result()
            except KeyError:
                continue
     else:
         print("No EAF files to process.", file=sys.stderr)

vanlummelhuizen commented 8 years ago

@Woseseltops I forgot to tell you I tested it but the script did not crash at any point in my svn checkout of the corpus. And I did not see any changes you made to the script. So I incorporated your changes from your comment.

vanlummelhuizen commented 8 years ago

@Woseseltops So, I forgot I tested it and solved the problems and since then, no problems occurred. Now I have made the except clause a bit more informative so that if something goes wrong we (hopefully) know where to look.

Woseseltops commented 8 years ago

PS I also updated the view inside Signbank to now save the frequencies found by your script as part of the gloss.

ocrasborn commented 8 years ago

@Woseseltops , @vanlummelhuizen : I don't see the signer count yet for the regions, is that right? Or is there still an error somewhere?

ocrasborn commented 8 years ago

@Woseseltops : I get the following error every day by mail from the cron job: "/bin/sh: 1: Syntax error: Unterminated quoted string". Message subject: "curl https://signbank.science.ru.nl/dictionary/update_cngt_counts/ > "/scratch/signbank-logs/update_signbank_counts-"date + "data +"%d-%m-%Y"` 2>&1"

ocrasborn commented 7 years ago

@vanlummelhuizen seems to have solved it now; works. (Although I don't get a daily mail anymore, supposedly because there's nothing irregular/bad to report.)

Signbank / Global-signbank

Import frequencies from Corpus NGT into frequency fields #55