UUDigitalHumanitieslab / Reader-responses-to-translated-literature

Scripts for the DIOPTRA-L project (Digital Opinions on Translated Literature)
MIT License
0 stars 0 forks source link

Full text / txt download of the reviews #20

Open alexhebing opened 4 years ago

alexhebing commented 4 years ago

When creating the test corpus (i.e. The DInner and Harry Potter), Haidee explicitly asked for a txt version of the corpus, i.e. a file for each review that contains only the review text, and some metadata in the filename. I assume this makes it easier to work with (subsets of) the data in applications like Voyant (etc).

I can, and probably will, share the full corpus with Haidee and Gys-Walt, including txts, once the scraping is done. However, given the number of titles, I expect to scrape over 100.000 reviews. This makes selecting the txts for a subset virtually impossible.

Question: is it conceivable / do-able to add a full text download to I-analyzer, that would allow downloading a subset of reviews / documents? There is also a script I developed for @JosedeKruif that can do this type of thing (here), but the disadvantage of this is that customers would have to run python locally (and setup virtualenv etc). @BeritJanssen : what do you think, is a txt download from I-analyzer feasible?

BeritJanssen commented 4 years ago

This is certainly possible and shouldn't be hard to achieve. However, how to package this kind of behaviour on the frontend is a harder question. Potentially, we could make an extra corpus setting which means the csv download functionality will be replaced by a download of txts as zip. Should we discuss this tomorrow?