UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
274 stars 249 forks source link

Visual statistics for UD #688

Open zacateras opened 4 years ago

zacateras commented 4 years ago

I thought it may be worthwhile to mention here a tool that helped me to develop my own UD parser (MST, transformer-based) and suggest (after review) linking it on the external tools page of UD.

This is: a PowerBI dashboard of a few pages, each presenting selected visual statistics, calculated for the whole Universal Dependencies repository v2.4 (v2.3 at the time of development). Simple metrics, like average length of sentences and tokens or distribution of UPOS for all treebanks put together were useful to take decisions on architectural design of parser as well as decisions on which banks shall be considered during testing phase to achieve desired results. These metrics can be used to support other works which involve simple comparison of UD data or to introduce UD to new people with simple, nice to look at interface.

The tool was developed together with the parser and then released publicly at https://zacateras.github.io/udstats/. How do you asses further usability of support and development of the dashboard?

jnivre commented 4 years ago

Very nice indeed! Just a small thing I noticed regarding sentence length. If I understand corrrectly, sentences are binned into 15 bins at equally sized intervals between 0 and the longest sentence in the treebank. This means that bins can vary largely in size between treebanks and, for some treebanks that have a few extremely long sentences, most sentences fall into the first or second bin. I think it would be more useful here to use constant sizes, say, 1-5, 6-10, ..., 76+, given that most sentences in most languages are less than 50 words long.

martinpopel commented 4 years ago

@jnivre: In the top right corner, you can restrict the sentence lengths to e.g. 1-76 words, which gives you almost what you suggest (15 linearly distributed bins 1-6, 7-11, 12-16,...,72-76), just the bin 76+ is missing (you need to set the limits to 77-682 to see the rest). Most of the Microsoft PowerBI graphs are similarly dynamic (you can usually restrict the statistics to bigger/selected treebanks only, click/hover on some parts of the graphs etc.).

dan-zeman commented 4 years ago

Nice! Are you going to update it after each UD release? Or what would an automatic update involve?

jnivre commented 4 years ago

Cool, thanks! As usual, I didn't have time to look properly. :)

zacateras commented 4 years ago

@dan-zeman it does not update automatically yet. Currently the "manual-update" pipeline looks like this:

  1. execute bash script to download and unzip UD (ready, needs to be parametrized with UD version or link)
  2. execute python script to combine all treebanks into single CSV (ready)
  3. reload PBI model file from new CSV (manual click in PBI Desktop)
  4. upload PBI model file to PBI Service (manual configuration in PBI Service)

Currently PBI model file stores all data and cannot be easily stored as code on repo. This (code & data) may possibly be separated. If I figure out how to do it, I will put all aforementioned components in a separate repo. If we have them in a repo we will only need to figure out how to programically reload the model and deploy it to PBI Service.

dan-zeman commented 4 years ago

Great! As long as it can stay up-to-date without requiring a regular action from me, I welcome it :-D

Linking it from https://universaldependencies.org/tools.html sounds like a good idea. I am not sure what section it could go in, perhaps Visualization tools.

zacateras commented 4 years ago

@dan-zeman Visualization tools seems to be the most appropriate.

I published the dashboard sources on github here and simplified version management of the dashboard. However, on current stage of PowerBI development fully automated pipeline is impossible. Updating will require a few manual actions of mine (or anyone else who will set up a PBI account).