PolMine / LinkTools

R package with tools for data linkage
GNU General Public License v3.0
0 stars 0 forks source link

Use decode() instead of RcppCWB to decode subcorpora or use s_attributes() instead #17

Open ChristophLeonhardt opened 1 year ago

ChristophLeonhardt commented 1 year ago

If the text data object is split to save memory, subcorpora are created. This process starts here:

https://github.com/PolMine/LinkTools/blob/60543c9a433a5ff33f1dffaf3d719f4f501d89b5/R/LTDataset.R#L323

Because of an issue in polmineR nested corpora can be tricky to decode if the structural attributes used are not on the same level. Hence, currently for subcorpora, a combination of RcppCWB::cl_cpos2struc() and RcppCWB::cl_struc2str() is used. This is a workaround and should be changed as soon as polmineR::decode() works for this use case.

Then the corresponding line would probably read:

decode(i_split, s_attributes = names(self$match_by), p_attributes = character())

Note: The entire purpose of this endeavour is to get a region matrix in which the regions are defined by an arbitrary combination of structural attributes. There should be an easier way to achieve this.

The current, albeit probably unintended - implementation of s_attributes() for corpora seems to do this quite efficiently. If this can be used in a robust way, this might be the better solution anyway.