-
It should be possible to download larger datasets. To decide: when to do download and when to direct to github repo of data or other pre-compiled datasets? Should there be an upper limit to how much d…
-
-
I've been working on a project at [this repo](https://github.com/Plaba/US-Congress-Corpora-Builder). This downloads the congressional transcripts from congress.gov and converts them to text.
Since…
Plaba updated
4 years ago
-
Hi. I am trying to understand you approach and I still don't quite see how alignments are done for unrelated text and speech corporas. Could you please explain that and point out the files in the code…
-
In the current test file there are already attributes `@xml:id`s for characters `` in the ``. They are in Georgina script, which seems not be be a problem for the wellformedness of the XML though.
I…
-
Corpora word space cleanup for larger corpora (Child Directed Speech, Gutenberg Children Books).
Clean Gutenberg Children corpus to ~ 12,000 words to get PA/PQ in reasonable time.
-
Question from Jo Guldi:
> What about including a recommended citation format (or series of formats) for each speech?
I like this a lot! Including citation information is probably relevant for no…
-
Allow discourses and spontaneous speech corpora in general to be exported from PCT
-
Both Henk and Darja indicated that the results in the VLO are often overshadowed by records located in the lower sections of the CMDI hierarchy (e.g. "sessions" in speech corpora). There is the _Only …
-
Access speech db exact entries from searches in korp.