DARIAH-DE / TopicsExplorer

Explore your own text collection with a topic model – without prior knowledge.
https://dariah-de.github.io/TopicsExplorer
Apache License 2.0
62 stars 10 forks source link

Problem with uploading a corpus to DARIAH Topics Explorer #124

Closed talvera98 closed 3 years ago

talvera98 commented 3 years ago

Hello,

I'm working at a project where we want to use digital methods to analyse a corpus of German texts.

Unfortunately, I cannot upload the corpus to Dariah. If I upload any of the prepared corpora available on TextGrid Repository, there is no problem; the program runs normally and shows the results. So it has to be a problem with my corpus, I guess. But the corpus is a plain text file and on Dariah it says it's possible to work with any of them. A professor at my university suggested to change the coding of the text file from Windows to Unicode UTF-8. I tried that but it didn't change anything; the Topics Explorer still doesn't accept the file.

I attach a screenshot of the error message and the two text files I tried it with (In Windows coding).

Does anyboy have an idea what the reason for the problem is? I'd be very grateful if you could take a look at my files, maybe even run them through the Topics Explorer, and could suggest what change I have to make. Thanks a lot! Screenshot Dariah Error Korpus Ev. Texte Kirche in 1Live 2019.txt Korpus Kath. Texte Kirche in 1Live 2019.txt

severinsimmler commented 3 years ago

Hi @talvera98,

looks like the encoding of your text files is still broken. Is the original source of the text files publicly available?

tobimichigan commented 3 years ago

This issue iby @talvera98 is real. I also tried several texts and it returned the error above. @severinsimmler @mromanello @reckart @thvitt

severinsimmler commented 3 years ago

Please make sure your text files are UTF-8 encoded, @tobimichigan.

tobimichigan commented 3 years ago

@severinsimmler , I later ran(python topicsexplorer.py --browser) with pipenv and it ran successfully. It wasn't about UTF8 encoded.