PolMine / polmineR

R-package for text mining with the Corpus Workbench (CWB) as backend
49 stars 9 forks source link

CQP-method to query for single quotation marks? #171

Closed Studentenfutter closed 2 years ago

Studentenfutter commented 4 years ago

I want to write a CQP-query to extract names between quotation marks, for example Verein "Mehr Demokratie" When I try this with an escaped qutation mark [word="\\""] polmineR returns an error:

Number of quotation marks is not divisable by 2: Opening quotation marks are not matched by closing quotation marks, or vice versa. Aborting to avoid a potential crash of CQP and the entire R session. Please check query.

The (OpenCPU)-corpus I use has escaped quotation marks, this can be seen by using the query [lemma="Verein"] [pos="PUNCT"] [pos="PUNCT"] []* [pos="PUNCT"] [pos="PUNCT"] which returns, for example, Verein \" \" Die wahre Religion \" \"

The usage of [pos="PUNCT"] however is not satisfying workaround because it matches all punctuation and creates many false positives.

How could I write a query that catches the (escaped) quotation marks without crashing CQP?

ablaette commented 2 years ago

This is a potential solution:

count("GERMAPARL", query = "'``' 'Mehr' 'Demokratie' 'wagen' '(\\'\\'|``)'", cqp = TRUE)

But this is also based on prior knowledge that opening and closing quotation marks are represented somewhat differently in the corpus - I do not yet have a sufficient idea where we could document that.

ablaette commented 2 years ago

My apologies that the most plausible solution did not occur to me. Checking the CQP syntax is really a good idea for the robustness of the code, but in this special case, you could just omit the check by setting the argument check to FALSE. the warning issued by check_cqp_query() now includes the respective hint.