drizopoulos / ltm

Latent Trait Models under IRT
30 stars 12 forks source link

Error with ltm when data includes a specific case #1

Closed mattantaliss closed 7 years ago

mattantaliss commented 8 years ago

I've been attempting to calculate item difficulty and discrimination for a large dataset I have, and I keep running into the following error: Error in if (any(ind <- pr == 0)) pr[ind] <- sqrt(.Machine$double.eps) : missing value where TRUE/FALSE needed

A quick search led me to Error using ltm R package and, as suggested, I tried removing any items (columns) with fewer than two responses. After the error persisted, I also tried removing any people (rows) with fewer than two responses; the error still remained. My next thought was the error may be related to a lack of memory, so I started processing my data in smaller chunks. After that did not work, I started digging around in the debugger to see how else I might need to clean my data.

I've attached the particular sample of my data that most recently generated the error and was used for debugging. To walk through a sort of back trace, I started in probs.R where the error is thrown. With these data, the problem is in line 6 because no pr == 0, but NAs are present, so any returns NA. It may be that this issue is resolved merely by including na.rm = TRUE in the calls to any, but I'm not proficient enough with IRT to know if that's too late to catch NAs.

After figuring out the problem in probs.R, I wondered why NAs were in there. I then worked my way back from probs.R to EM.R, ltm.fit.R, and ltm.R. That led to looking into how betas is calculated. Following the execution in start.val.ltm.R, I found the origin of the NAs: the glm call.

This brings us to the very specific case in which ltm produces an error. For Q128 in my data, only two people submitted a response; in this case, one was correct while the other was incorrect. The problem is those two people happened to answer the same number of questions correctly. This means Z$z1 is the same for these two points. The two points sent to glm are (-1.3, 1) and (-1.3, 0), which obviously has an undefined slope, and hence, the NA is produced.

For now, I'll work on cleaning my data some more to exclude this specific case, but it would be nice if I could just throw my data into ltm without having to first clean it as mentioned above (namely, removing columns and rows with fewer than two responses and removing this specific case I found).

irt.zip

drizopoulos commented 8 years ago

Thanks for reporting this. I'm a bit reluctant to automatically exclude columns or rows of the data without the intervention of the user. As you noticed, the reason why these columns and rows need to be excluded is because they actually contain very minimal information. I'd be more inclined to put a warning message and alert the user that some items are potentially problematic, and leave it up to him/her to decide how to proceed, and which specifically to exclude.

mattantaliss commented 8 years ago

That is quite understandable. If you do go the warning message route, it'd be good to include information on why an item is problematic (e.g., fewer than two unique responses) so the user knows what needs to be corrected.

For a quick update, I did go ahead and clean my data more as I mentioned, but I still kept getting that same error. This time around, though, it was with a different subset of data, and it seems to be rooted in NaNs coming back from nb[i, ] <- betas[i, ] + solve(hes, sc), which I don't know how to correct via data cleaning.