PolMine / polmineR

R-package for text mining with the Corpus Workbench (CWB) as backend
48 stars 9 forks source link

corpus_registry_dir() duplicating on Windows #267

Open maw44989 opened 11 months ago

maw44989 commented 11 months ago

We are using polmineR for a Text and Corpus Analysis class for Undergraduate and Graduate students. For individuals using polmineR on Windows, there is a recurring issue preventing use of polmineR. Here is the ouput of the error:

" error in evaluating the argument '.Object' in selecting a method for function 'count': Cannot initialize corpus object - corpus defined by two different registry files."

Included below is the issue on Windows and a positive example of how polmineR correctly works on Mac.

Windows Issue

Version Numbers

packageVersion("RcppCWB") [1] 0.6.2 packageVersion("polmineR") [1] 0.8.8 R version 4.3.1 (2023-06-16 ucrt) -- "Beagle Scouts"

Check path before loading polmineR

c_regdir <- fs::path(RcppCWB::corpus_registry_dir("BNC")) c_regdir NA

Check path after loading

library(polmineR) c_regdir <- fs::path(RcppCWB::corpus_registry_dir("BNC")) c_regdir R:/windows_registry

Run count command once

QD <- count("BNC", query = "'quite' [pos = '(DT0|DTQ)']", cqp = T, regex = T) head(QD) query count freq 1: 'quite' [pos = '(DT0|DTQ)'] 626 5.581494e-06

Registry is now duplicated (length = 2), prompting the above error on all future commands using polmineR

c_regdir <- fs::path(RcppCWB::corpus_registry_dir("BNC")) c_regdir R:/windows_registry R:/windows_registry

Mac Success

Versions

R version 4.3.1 (2023-06-16) -- "Beagle Scouts"

packageVersion("polmineR") [1] ‘0.8.8’ packageVersion("RcppCWB") [1] ‘0.6.2’

Set Registry Environment and load polmineR

Sys.setenv("CORPUS_REGISTRY" = "/Volumes/cwb_registry/mac_registry") library(polmineR)

Run count command --> registry is still length 1

QD <- count("BNC", query = "'quite' [pos = '(DT0|DTQ)']", cqp = T, regex = T) fs::path(RcppCWB::corpus_registry_dir("BNC")) /Volumes/cwb_registry/mac_registry RA <- count("BNC", query = "'rather' [pos = '(A.)']", cqp = T, regex = T) RA query count freq 1: 'rather' [pos = '(A.)'] 12658 0.0001128603

Even after running count() twice the corpus_registry_dir of the British National Corpus stil has length 1. On Windows it doubles and becomes length 2

fs::path(RcppCWB::corpus_registry_dir("BNC")) /Volumes/cwb_registry/mac_registry

jthale76 commented 11 months ago

We can actually make it happen using only lower-level RcppCWB functions. The call to cqp_subcorpus_size shows that the search itself was successful. Have there been any changes in the last year to RcppCWB that might lead to such doubling? The demonstration below uses Windows version 10.0.19045

library(RcppCWB) packageVersion("RcppCWB") [1] ‘0.6.2’

Sys.setenv("CORPUS_REGISTRY"="R:/windows_registry") cqp_reset_registry(registry = Sys.getenv("CORPUS_REGISTRY")) [1] TRUE

cqp_query(corpus = "BNC", query = '"the";') <pointer: 0x000001e1fc908c50>

cqp_subcorpus_size("BNC",subcorpus="QUERY") [1] 5405646

corpus_registry_dir("BNC") R:/windows_registry R:/windows_registry

cqp_query(corpus = "BNC", query = '"of";') <pointer: 0x000001e1fc908c50>

corpus_registry_dir("BNC") R:/windows_registry R:/windows_registry`

ablaette commented 10 months ago

I face a similar issue when doing this on macOS:

library(polmineR)
use("GermaParl2")

foo <- corpus("GERMAPARL2MINI") %>%
  subset(protocol_date == "1949-09-07", verbose = TRUE) %>% 
  subset(speaker_name == "Konrad Adenauer", verbose = TRUE)

foo <- corpus("GERMAPARL2MINI") %>%
  subset(protocol_date == "1949-09-07", verbose = TRUE) %>% 
  subset(speaker_name == "Konrad Adenauer", verbose = TRUE)

It's absolutely clear that this issue needs to be solved. Apologies for taking it up this late!

ablaette commented 10 months ago

There is a closely a related issue on macOS: RcppCWB::cl_struc_values() will result in a corpus being loaded twice.

https://github.com/PolMine/RcppCWB/issues/77