Jean-Baptiste-Camps / stemmatology

Stemmatological Analysis of Textual Traditions
GNU General Public License v3.0
15 stars 3 forks source link

Error in PCC.disagreement: Input is not a numeric matrix. #51

Open GGoetzelmann opened 6 years ago

GGoetzelmann commented 6 years ago

Thank you for this R package, it looks like an interesting project. I have started playing around with it (without much insight into the stemma creation method yet and with no experience in R at all)

I have tried to use data with multiple readings the parameter alternateReadings=TRUE

In the interactive mode the first few steps work fine but then the error

Error in PCC.disagreement(tableVariantes, omissionsAsReadings = omissionsAsReadings) : Input is not a numeric matrix.

is thrown.

I have tried with a real data set first but then used the example matrix from the documentation as test data.

I had to duplicate the matrix a few times, otherwise I got the error

Error in cluster::pam(ordConflTot[, 1], numberOfClasses) : Number of clusters 'k' must be in {1,2, .., n-1}; hence n >= 2

So my minimal data example for the error would be:

 A D F T P
1 "1" "2" "2" "2" "1,2"
2 "1" "2" "1,2" "2" "1"
3 "1" "1" "1" "1" "2"
4 "1,3" "1,2" "1" "2" "3"
5 "1" "2" "2" "2" "1,2"
6 "1" "2" "1,2" "2" "1"
7 "1" "1" "1" "1" "2"
8 "1,3" "1,2" "1" "2" "3"
9 "1" "2" "2" "2" "1,2"
10 "1" "2" "1,2" "2" "1"
11 "1" "1" "1" "1" "2"
12 "1,3" "1,2" "1" "2" "3"
13 "1" "2" "2" "2" "1,2"
14 "1" "2" "1,2" "2" "1"
15 "1" "1" "1" "1" "2"
16 "1,3" "1,2" "1" "2" "3"

I have loaded it from a txt file with mydata = read.table("filename.txt") and mydata = as.matrix(mydata) and then used PCC(mydata,alternateReadings=TRUE).

Jean-Baptiste-Camps commented 6 years ago

Hi, thanks for your interest in the package ! I did not succeed in replicating this bug for now. Did you install from CRAN ? Perhaps you are still on a version < 3 ?

For k-medoïds, I will correct that. It is because you can't have more clusters than individuals.

Jean-Baptiste-Camps commented 6 years ago

PS: as a side note, the handling of alternateReadings is not fully implemented in the stemma building functions, because I do not have a good algorithm for that yet (and also because cases with alternateReadings on the same witness are excessively rare for the romance texts I work mostly with).

GGoetzelmann commented 6 years ago

@Jean-Baptiste-Camps thank you for your reply. Yes, I installed from CRAN, the installed version was 0.3.1 I tried today to install the github version but I am not sure I succeeded. Both show the same version (via sessionInfo()), I think.

Since you cannot reproduce the problem, I tried to use the steps in the example from the readme instead of the interactive PCC function. On my minimal example above, I used a threshold of 0.1 and skipped the step with "myNewData = PCC.equipollent"

So basically:

> myConflicts = PCC.conflicts(mydata,alternateReadings=TRUE)
> myConflicts = PCC.overconflicting(myConflicts, ask = FALSE, threshold = 0.1)
> myNewData = PCC.elimination(myConflicts)
> myConflicts = PCC.conflicts(myNewData,alternateReadings=TRUE)

myNewData then is

   A   D   F     T   P
2  "1" "2" "1,2" "2" "1"
3  "1" "1" "1"   "1" "2"
6  "1" "2" "1,2" "2" "1"
7  "1" "1" "1"   "1" "2"
10 "1" "2" "1,2" "2" "1"
11 "1" "1" "1"   "1" "2"
14 "1" "2" "1,2" "2" "1"
15 "1" "1" "1"   "1" "2"

and PCC.Stemma(myNewData) fails with Error in PCC.disagreement(tableVariantes, omissionsAsReadings = omissionsAsReadings) : Input is not a numeric matrix. Which is true, but there doesn't seem to be a parameter to say otherwise?

I find alternateReadings interesting, because I deal with ancient data sets where it is very likely that a word or part is only partially readable. So imho this looks like a way to say something like "this is either (one of) the word(s) in other witnesses at this position or something different". At least it would be something worth comparing to always using '?' for damaged words. And very fragmentary and therefor uncertain/fuzzy data is a problem for a lot of (phylogenetic) approaches anyway.

Jean-Baptiste-Camps commented 6 years ago

Ok, many thanks on your very clear message. I understand now. Indeed, there is no algorithm for the moment that allows alternative readings for stemma building (PCC.Stemma is very strict and allows only for a single reading per witness, I would have to implement other algorithms to allow that). BUT, actually, there is a way to do exactly what you want to do, if I understand well, which is to use NA for 'not available'/'no answer' (it is a basic R type for missing value, cf. https://www.rdocumentation.org/packages/base/versions/3.5.0/topics/NA). NA's are handled by PCC.Stemma. As you say, they can be problematic, as the algorithm will take into account only the information it has, but it is manageable up to a certain point, as long as there are enough points where the witnesses can be compared.

GGoetzelmann commented 6 years ago

I see, thanks for the clarification.

I am aware of the NA feature and I find it very useful. In fact I wanted to compare a matrix with NA readings with an alternative encoding which tries to assign multiple readings to fragmented words. That is where I encountered the issue. My data set right now is already quite sparse, so every reading would help, I guess. But atm it is just on-the-side playing around out of curiosity. I'll watch the project with interest for sure.