bnosac / BTM

Biterm Topic Modelling for Short Text with R
Apache License 2.0
94 stars 15 forks source link

compare biterm topic modelling to rainette, LDA, coclustering, structural topic model, embedding clustering, autoencoders #9

Open jwijffels opened 5 years ago

jwijffels commented 5 years ago

Looking for some typical open data with short texts which are interesting, in order to compare clustering methods (BTM / LDA / stm / coclustering / reinert text clustering / embedding clustering / autoencoder) @datasculptor / @manuelbickel you know any interesting open data?

rdatasculptor commented 5 years ago

I have never used (or even taken a look at) this dataset before, but it maybe interesting: https://registry.opendata.aws/amazon-reviews/

jwijffels commented 5 years ago

Interesting and huge dataset, but unfortunately the license of that data is too restrictive.

rdatasculptor commented 5 years ago

You are right. How about this list of tweet collections: https://www.docnow.io/catalog/

jwijffels commented 5 years ago

Would prefer to use data which can be shared

rdatasculptor commented 5 years ago

Sorry for not checking before giving the link.

manuelbickel commented 5 years ago

I have not worked with short texts. Therefore, I have no good sources at hand, unfortunately. Maybe Japanese Haiku to make Text Mining more philosophical ;-)?

Side Note: sorry for not having worked on the quality metrics yet, too many other non-R-related projects, will keep it on my list, for the time being, text2vec::coherence might be used.

Am 26. Juni 2019 09:46:52 MESZ schrieb jwijffels notifications@github.com:

Looking for some typical open data with short texts which are interesting, in order to compare clustering methods (BTM / LDA / stm / coclustering / reinert text clustering / embedding clustering / autoencoder) @datasculptor / @manuelbickel you know any interesting open data?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/bnosac/BTM/issues/9

-- sent via mobile - please excuse typos

jwijffels commented 5 years ago

No problem. Japanes Haiku, yes, why not :)

rdatasculptor commented 5 years ago

Could this be interesting? https://www.linkedin.com/feed/update/urn:li:activity:6553904839447973888 Not that I am a fan or something :-)

jwijffels commented 5 years ago

I'm sure you are a fan :)

rdatasculptor commented 5 years ago

Also this one could be interesting: https://github.com/EmilHvitfeldt/textdata

msaeltzer commented 4 years ago

You could look at manifestos. manifestoR is an API to coded political text in several languages.
https://github.com/ManifestoProject/manifestoR While manifestos are (very) long texts, they are coded here as quasi-sentences, statements that can be sentence level or sub-sentence level. They make up short micro texts of specific topics. While the coding is useful, it is far from perfect. It gives an idea about the number of topics in the text, but are not conclusive, as they can be aggregated to higher categories like issues and domains. I am working with them right now, using BTM.

jwijffels commented 4 years ago

Interesting. Didn't know these political party manifesto's existed.