Open kevinschaper opened 3 years ago
@kevinschaper This is definitely a bug in KGX. It should be aware that xref
, authors
(and other fields) are multi-valued and parse them as such.
The fix should be straightforward. I'll have a PR for this soon. I think the way the multivalued properties are parsed is a little brittle (my bad!). This can be a good opportunity to tidy a few things up :)
@kevinschaper - is this impacting the pydantic validation you're doing downstream?
Whoops, I should have replied a year ago. I was just looking at whether it would be practical to put kgx validate back into my pipeline and I noticed this problem again.
I'm not sure if I made a different issue for it at the time, but the biggest challenge is definitely the number of repeats for each type of error. It would definitely be more useful if it just reported the first N occurrences (5, 10, 30?) of each error, rather than an exhaustive list of each record that failed.
I'm not sure if I made a different issue for it at the time, but the biggest challenge is definitely the number of repeats for each type of error. It would definitely be more useful if it just reported the first N occurrences (5, 10, 30?) of each error, rather than an exhaustive list of each record that failed.
Hmm... didn't we fix the KGX reporting some time ago, to be less verbose?
Hi folks. I've been playing with kgx a bit and faced the same problem. I introduced this in my validate test to make the output a little more tractable. Seems to work for me in a validate/fix/validate dev cycle.
validator.validate(the_graph)
if len(validator.errors) > 0:
for e in validator.errors:
error_dict = validator.errors[e]
for k in error_dict.keys():
print("root error type: ", k, " specific errors:", error_dict[k].keys())
for ek in error_dict[k].keys():
print(ek, " element count ", len(error_dict[k][ek]))
n = 0
for element in error_dict[k][ek]:
n += 1
print(element)
if n > n_error_examples:
break
print("error")```
Resolved by https://github.com/biolink/kgx/pull/415 (will be in master once the PR is processed.. not sure about when the patch will show up in an official release...)
@RichardBruskiewich the problem with reporting very large numbers of errors isn't limited to the case here. Suppose that is a separate issue more related to the interface with the validation system.
Hi @goodb, As I noted earlier (above) in this issue, I seem to recall that we rehabilitated the error reporting to be somewhat more concise and possibly hierarchical. That said, I've not looked at that aspect of the code base for well over a year or so.
That said, I'll take another look at it over the next few days to confirm the status of things. If the validation reporting is still problematic, we might wish to open up a separate ticket.
I'm also going to compare our latest Translator validation code (in the pypi.org reasoner-validator package) which generally validates Translator web service data (including knowledge graphs) to check if there is suitable overlap in validation processes to apply a DRY upgrade to the KGX validation.
Cheers Richard
It looks like #415 never made it in. I went looking for this issue because I just got the error with kgx 1.7.2, I wasn't able to run 2.0 in the cli, but I'll try running it as a module.
Hi @kevinschaper, I've just pinged Sierra to inquire about this since I'm not that active on the KGX front right now.
I went ahead and fixed up the changes in #415 to satisfy the good tests (see #439); I don't think they solve the entirety of this issue, so keeping this open pending more changes.
I noticed that I was getting a lot of these validation errors:
When I looked at the file, I found that I had single values in multi-value fields (these were publication nodes, so it was xref and author). For example:
It looks like
FB:FBrf0187243
andP.G. Wilson
aren't recognized as lists simply because they only have a single value - am I right that the format for a multivalue field doesn't include any sort of brackets to denote a list, just the pipe separator?I'm not sure what the right approach to fix it is, but my guess is that the tsv reader should be biolink-aware enough to put the single values into a list if it's multivalued field?