Closed rrojas350 closed 1 year ago
Are there other known cases in which this happens? In any case, I guess the action item for this issue is to perform an audit of the CVR ingestion code to try to fix any other places where it might occur.
Nothing comes to mind but this issue caused enough of a stir that if we can be proactive about a fix it's worth the effort.
Hey guys, exciting to see the recent activity here. I wanted to give some thoughts on CDF and how RCTab might approach dealing with it.
TL;DR: update RCTab CDF parsing code in response to actual vendor CDF implementations and the schema they expect / provide.
CDF is an abstract data schema designed with flexibility in mind. It allows for many different implementations. See the uml diagram in our repo to appreciate that, many of the components (including GPUnit) are totally optional (denoted by 1..0 which indicates the element may appear 1 or 0 times) and this is still "valid" CDF. Many of the elements can be in an array (or not) denoted by 1..* or (the asterisk means "any number of instances").
So, it's certainly possible to build code in RCTab which validates input data against the entire schema (and https://github.com/JDziurlaj has done this) but for example, a missing GPUnit is still technically valid CDF. So what should RCTab do in this case?
https://github.com/JDziurlaj has put forth the idea of CDF "profiles" which are basically just a concrete implementation of CDF, which implement some subset of the elements. (btw he is the foremost expert on CDF so send your questions his way) In this new example, it appears the input data has no GPUnits. Is that valid? What of the other problems @tarheel found? Are they actual problems, and which are ok? Only the CDF data provider can answer these questions, and so I would suggest that RCTab extend its code to ingest new data providers as they emerge.
There are I'm sure many other general improvements which could be done, and I would agree that any input which throws a NPE should at least be caught and translated into a generic parse error.
I believe the easy version of this is done by #309.
We should get JDiurlaj's implementation by April 1 and we can determine then if it's integratable here: https://github.com/usnistgov/cdf-test-method
Is there anything else not encompassed by these two? If not, perhaps we can rename this task to integrate (or explore integrating) that repo.
The repo is now live. I have a few concerns with using it:
I don't have as much familiarity as y'all do with what causes NPEs, but I'm not sure schema validation will be worth the implementation effort currently.
My gut feeling here is that we already have a basic validation system, and we should clean and improve that over time as we get reports of NPEs. If users are frequently complaining about NPEs with unknown causes, having a vague error message won't help any more than the NPE error message. In sum: we should punt on this task, and instead focus on gathering as many files that create NPEs and figuring out user-friendly messaging one at a time while hardening our existing validation system.
I'd love to hear thoughts from those more familiar with this code. @tarheel @HEdingfield?
I think #644 made some progress on this.
I'm not sure what additional immediate action can be taken here -- can we consider this closed until we have a more concrete plan, or move this out of 1.4.0
requirements so we can sit on it longer and come up with a plan?
Agreed, I'd suggest we open a new reminder issue outside of v1.4.0 and link back to your nice summary in https://github.com/BrightSpots/rcv/issues/632#issuecomment-1495046567.
Then we can close this so it's still in the record as being some work we did for 1.4.0.
SGTM. Issue linked above.
See Issue #624 for details.
When RCTab encounters an error with a CVR the messaging needs to be clearer than an NPE and let the user know if something is missing.