New boundary to differentiate multiple documents in one file

EducationalTestingService / rstfinder

Fast Discourse Parser to find latent Rhetorical STructure (RST) in text.

MIT License

123 stars 24 forks source link

New boundary to differentiate multiple documents in one file #8

Closed dmnapolitano closed 10 years ago

dmnapolitano commented 10 years ago

Hello. Using \n\n as a paragraph boundary marker within a document is pretty common; however, in rst_parse we're looking for \n\n to differentiate one document from another in a file containing multiple documents. Is this a Penn Treebank thing? If so, we can leave this as the behaviour during training (-t) and then make rst_parse either (a) only take one document per file at a time, or (b) look for a different document marker. Thanks :smile:

dan-blanchard commented 10 years ago

Well, currently rst_parse is expecting output from the discourse segmented, which puts those as delimiters. I we had one script we called that didn't require running the segmenter first, then we would definitely want to change that, but for now, we're the ones generating the input to rst_parse.

dmnapolitano commented 10 years ago

So we don't want rst_parse to take one file containing > 1 non-discourse-segmented documents? That's fine by me. :+1: :smile:

dan-blanchard commented 10 years ago

I don't believe so. I think it should take a list of files eventually.

dmnapolitano commented 10 years ago

Think we're good :smile: