bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Preserving document order #44

Closed allengoebl closed 5 years ago

allengoebl commented 5 years ago

When document_term_matrix() is called after document_term_frequencies(), the resulting dtm is ordered differently from the original character vector. Although it is possible to correctly order the dtm using the document id's this behavior seems undesirable.

jwijffels commented 5 years ago

That's correct. It's ordered by doc_id as per https://github.com/bnosac/udpipe/blob/master/R/nlp_flow.R#L201

allengoebl commented 5 years ago

Ok, from your link it looks like this part of the code is ordering the variables:

x$document <- factor(as.character(x$doc_id))

Here is an example showing the potentially problematic behavior

levels(factor(as.character(paste0('c', 1:10))))

Which produce the following output:

[1] "c1" "c10" "c2" "c3" "c4" "c5" "c6" "c7" "c8" "c9"

Note that the 10th variable has now ordered second. One possible solution would be to use numeric factors that contain no characters as the default doc_id's.

jwijffels commented 5 years ago

Yes, I understand what factor does.

allengoebl commented 5 years ago

What I'm trying to point out is that this line of code changes the order of the documents. In many cases, it is helpful to retain the original order of the documents.

The reason the document order changes is because the automatically generated document names start with a character and are thus ordered alphabetically with "c1" being followed by "c10".

jwijffels commented 5 years ago

can you show an example where it is helpful to retain the original order of the documents?

allengoebl commented 5 years ago

Perhaps you a trying to make predictions based on written survey responses. You have a large dataset of survey responses and you want represent these responses as a tifidf matrix. Ideally this tifidf matrix should retain the same order as the original dataset used to create it so that your predictions match up with the data.

jwijffels commented 5 years ago

Ok. You can always use match to achieve that they are in the same order again. That's what I do in such case, in order to be 100% sure tfidf matrix and response is of the same size and in the same order. Other use cases you had in mind?

allengoebl commented 5 years ago

I think that is the main use case. There are definitely workarounds but I think behavior like this has a tendency to cause errors in peoples code (even though these errors could have been prevented by better coding practice)

If you want to keep the current default doc_id behavior I think it might be possible to directly set the factor order using something like:

x$document <- factor(as.character(x$doc_id), levels = x$doc_id)

jwijffels commented 5 years ago

I personally think that if people have the tendency not to use matchto align the dtm with other data, they might be terribly mistaken. Most of the time your response in your predictive model comes from other data which might be filtered on after you did the udpipe annotation. For this reason, I would always advise to use match in order to align the rownames of the dtm with another dataset. Nevertheless the fix is fine for me. Changed the behaviour of document_term_matrix in commit https://github.com/bnosac/udpipe/commit/eac2d12405f58e89d68c6f53650b6387e966c9ea