PhonologicalCorpusTools / SLPAA

5 stars 0 forks source link

side script: counting x-slots #238

Closed kchall closed 10 months ago

kchall commented 10 months ago

It would be fantastic if we could have a little side script that could go through a set of .corpus files and give a report of the number of x-slots per sign, e.g. outputting the gloss and the number of x-slots in a tab-delimited or csv format. In an ideal world, it would also include a column that indicates if the sign were fingerspelled or not. This could be designed in a way that will feed into functionality like this in the GUI, but it doesn't have to at the moment -- we just actually need the counts for a separate project. :)

kvesik commented 10 months ago

@kchall a rudimentary x-slot counter is available via the "analysis functions (beta)" menu, on branch 238. Not pretty, but I think it does what you need it to! Let me know if you like it and I'll push it to main as well.

kchall commented 10 months ago

@kvesik This is fantastic, thank you! Would it be easy to modify it to have an option to append all of the results into a single .txt file instead of separate ones for each corpus? (Obviously we can do this manually post hoc, but if it's easy to add that to the script, that would be more efficient.)

kvesik commented 10 months ago

Done! And whether you record the results separately or combined, the first column in the results will be the name of the corpus (.slpaa filename). I'm going to push to main, but let me know if any of the wording should be refined for clarity (or any other requests) and I'll tweak as needed.

kchall commented 10 months ago

@kvesik This looks perfect -- thank you!

kvesik commented 5 months ago

@kchall this will need to be updated once all of the new #105 (entry id, multi-glosses, lemma, id-gloss) stuff gets implemented. Originally the output had one sign per row, referencing both the name of the corpus file as well as the (single, required) gloss.

With all of the new types of identifiers available (and with gloss no longer being a unique identifier for a sign), I was considering including all of the following in the updated version of the output:

Is that good? Too much? Let me know what your preferences are.

kchall commented 5 months ago

Thanks, @kvesik! I don't know that we necessarily need to update this right away at all; it was for our own internal use, and in our cases, we do in fact just have single glosses per sign. And now that Grace is working on the main search functions, those should already incorporate a more flexible x-slot counter...so I think it's okay to just let this sit. But in general, yes, I think the list of elements you propose here are all fine.

kvesik commented 5 months ago

@kchall in branch 105 I had already added an extra column for entryid, and made sure that if there are multiple glosses that all of them are being included in the output. I'll leave the update like that for now and any further details can be sorted out with the more advanced search functionality. Thanks.