Open zacateras opened 4 years ago
Very nice indeed! Just a small thing I noticed regarding sentence length. If I understand corrrectly, sentences are binned into 15 bins at equally sized intervals between 0 and the longest sentence in the treebank. This means that bins can vary largely in size between treebanks and, for some treebanks that have a few extremely long sentences, most sentences fall into the first or second bin. I think it would be more useful here to use constant sizes, say, 1-5, 6-10, ..., 76+, given that most sentences in most languages are less than 50 words long.
@jnivre: In the top right corner, you can restrict the sentence lengths to e.g. 1-76 words, which gives you almost what you suggest (15 linearly distributed bins 1-6, 7-11, 12-16,...,72-76), just the bin 76+ is missing (you need to set the limits to 77-682 to see the rest). Most of the Microsoft PowerBI graphs are similarly dynamic (you can usually restrict the statistics to bigger/selected treebanks only, click/hover on some parts of the graphs etc.).
Nice! Are you going to update it after each UD release? Or what would an automatic update involve?
Cool, thanks! As usual, I didn't have time to look properly. :)
@dan-zeman it does not update automatically yet. Currently the "manual-update" pipeline looks like this:
Currently PBI model file stores all data and cannot be easily stored as code on repo. This (code & data) may possibly be separated. If I figure out how to do it, I will put all aforementioned components in a separate repo. If we have them in a repo we will only need to figure out how to programically reload the model and deploy it to PBI Service.
Great! As long as it can stay up-to-date without requiring a regular action from me, I welcome it :-D
Linking it from https://universaldependencies.org/tools.html sounds like a good idea. I am not sure what section it could go in, perhaps Visualization tools.
@dan-zeman Visualization tools seems to be the most appropriate.
I published the dashboard sources on github here and simplified version management of the dashboard. However, on current stage of PowerBI development fully automated pipeline is impossible. Updating will require a few manual actions of mine (or anyone else who will set up a PBI account).
I thought it may be worthwhile to mention here a tool that helped me to develop my own UD parser (MST, transformer-based) and suggest (after review) linking it on the external tools page of UD.
This is: a PowerBI dashboard of a few pages, each presenting selected visual statistics, calculated for the whole Universal Dependencies repository v2.4 (v2.3 at the time of development). Simple metrics, like average length of sentences and tokens or distribution of UPOS for all treebanks put together were useful to take decisions on architectural design of parser as well as decisions on which banks shall be considered during testing phase to achieve desired results. These metrics can be used to support other works which involve simple comparison of UD data or to introduce UD to new people with simple, nice to look at interface.
The tool was developed together with the parser and then released publicly at
https://zacateras.github.io/udstats/
. How do you asses further usability of support and development of the dashboard?