AbeHandler / rookie

The Rookie Text Analysis System
10 stars 1 forks source link

Document displayer? (Free text to html? research problem?) #210

Closed AbeHandler closed 8 years ago

AbeHandler commented 8 years ago

One hinderance to new corpora (or even gawker) is that Rookie needs to link to original URLs. (dont have for gawker). If Rookie hosted the documents then we need to break paragraphs etc for display which is non trivial. Just going from, say, goose output to pretty HTML for display is a worthy standalone problem. Basically: where put the paragraph breaks, right?

brendano commented 8 years ago

actually gawker ones can be reconstructed with the numeric id in all.meta. e.g. gawker/5487177 => http://gawker.com/5487177

brendano commented 8 years ago

if there's a corpus without paragraph breaks -- the gawker extraction currently like this, though we could try doing a new extraction to be better about it -- a reasonable way to display is to just add a <br> after every sentence. assuming sentence segmentation.

i guess the other way is all sentences in one big paragraph. i think that's harder to read usually.