Open wvdvegte opened 1 year ago
I had a similar idea for adding n-grams from Collocations. It would be nice to have a separate input for "appending" to tokens. The output would be added to Preprocess Text and to Collocations, the input to Preprocess Text. @PrimozGodec Is this a viable option?
Is your feature request related to a problem? Please describe. Certain types of documents, such as scientific publications, are often often accompanied by a list of keywords that typically contain N-grams, such as "generative neural networks", "fertility rates" or "consumer preferences". If metadata of the publications can be downloaded, these often appear in a separate column, separated by commas, semicolons or some other separator. Using Preprocess Text, these N-grams can easily be extracted by tokenization using the regexp [^;]+ (for ";" as separator). When analyzing the full texts or abstracts, it would be very useful if these same N-grams are also recognized as belonging together and not as separate words - including keyword N-grams from other documents that appear in a document's main text (but not in its keywords). Of course, N-grams can be extracted defining an N-grams range in Preprocess Text, but this will produce also many meaningless or less meaningful N-grams, especially if in-between stopwords have already been removed.
Describe the solution you'd like Ideally I would like to be able to connect two corpora as input to Preprocess Text, one with the main texts or abstracts from all documents and one with all the keywords, 1-grams and N-grams from all documents, presumably tokenized with Preprocess Text already. The second input is only used for a "keyword N-gram construction" step after tokenization of the first input (not necessarily at the end, like regular N-gram construction). Another option would be to allow for specifying an "N-gram keyword lexicon" file in the Filtering step, but that would require a two-step approach where the list of keywords has to be re-created and reloaded each time when documents are being added
Describe alternatives you've considered As said, use the regular N-gram construction option, which produces a lot of noise