Closed dmnapolitano closed 10 years ago
Well, currently rst_parse
is expecting output from the discourse segmented, which puts those as delimiters. I we had one script we called that didn't require running the segmenter first, then we would definitely want to change that, but for now, we're the ones generating the input to rst_parse
.
So we don't want rst_parse
to take one file containing > 1 non-discourse-segmented documents? That's fine by me. :+1: :smile:
I don't believe so. I think it should take a list of files eventually.
Think we're good :smile:
Hello. Using
\n\n
as a paragraph boundary marker within a document is pretty common; however, inrst_parse
we're looking for\n\n
to differentiate one document from another in a file containing multiple documents. Is this a Penn Treebank thing? If so, we can leave this as the behaviour during training (-t
) and then makerst_parse
either (a) only take one document per file at a time, or (b) look for a different document marker. Thanks :smile: