PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

Vignette | Minor Remarks #29

Open ChristophLeonhardt opened 3 years ago

ChristophLeonhardt commented 3 years ago

These are small things I noticed and which do not warrant an issue on their own.

library(bignlp)

library(bignlp) is called twice which isn't harmful but unnessary (line 67 and 85).

props in Workflow 3

You present three different workflows for different use cases. All three workflows call "props" when initialized. In the first two these are explicitly loaded from the package and configured, depending on the workflow. The third workflow calls the prop without saying where it comes from. While it is possible to conclude from the StanfordCoreNLP class that it should be the first method of calling and setting up props (including properties_set_threads(props, no_cores)), it might be useful to be explicit here.

Explaining annotate

I think it might be interesting to learn about the thread argument in Pipe$annotate(alist) and why it isn't passed explicitly in line 232. The documentation of the annotate method says:

threads If NULL, all available threads are used, otherwise an integer value with number of threads to use.

This is true, it will use all the threads by default (depending on the kind of annotators you use). But in my configuration, setting it to Pipe$annotate(annoli, 2L) the process still seems to use more cores at times. I assume that different annotators in the pipeline are using different number of cores. So maybe the "threads" argument does not control that globally?

If this is a substantial issue, I can also submit it accordingly.

propslist (lines 274ff)

I assume this is just an example, but it shouldn't work because the German models don't include lemma.

German Properties file

At line 443 you talk about the German properties file. There, in the file itself it might be useful to either add the annotators for "pos" and "ner" or to mention that they are turned off by the file that is provided by the package and can be added in the file manually.

https://github.com/PolMine/bignlp/blob/8e0fcfe11aa1c6b1d98e42aab906bb16bd0424bc/inst/extdata/properties_files/corenlp-german-fast.properties#L2