biolink / kgx

KGX is a Python library for exchanging Knowledge Graphs
https://kgx.readthedocs.io
BSD 3-Clause "New" or "Revised" License
116 stars 27 forks source link

Nodes with single value in multivalued field triggering validation errors #354

Open kevinschaper opened 3 years ago

kevinschaper commented 3 years ago

I noticed that I was getting a lot of these validation errors:

[ERROR][INVALID_NODE_PROPERTY_VALUE_TYPE] FB:FBrf0013765 - Multi-valued node property 'xref' expected to be of type '<class 'list'>'
[ERROR][INVALID_NODE_PROPERTY_VALUE_TYPE] FB:FBrf0013765 - Multi-valued node property 'authors' expected to be of type '<class 'list'>'
[ERROR][INVALID_NODE_PROPERTY_VALUE_TYPE] FB:FBrf0000035 - Multi-valued node property 'xref' expected to be of type '<class 'list'>'

When I looked at the file, I found that I had single values in multi-value fields (these were publication nodes, so it was xref and author). For example:

id  category    name    xref    provided_by authors creation_date   keywords    mesh_terms  summary type
biolink:Publication|biolink:NamedThing  Centrosome inheritance in the male germ line of Drosophila requires hu-li tai-shao function.    FB:FBrf0187243          P.G. Wilson     2005                    ....absract text...      IAO:0000013

It looks like FB:FBrf0187243 and P.G. Wilson aren't recognized as lists simply because they only have a single value - am I right that the format for a multivalue field doesn't include any sort of brackets to denote a list, just the pipe separator?

I'm not sure what the right approach to fix it is, but my guess is that the tsv reader should be biolink-aware enough to put the single values into a list if it's multivalued field?

deepakunni3 commented 3 years ago

@kevinschaper This is definitely a bug in KGX. It should be aware that xref, authors (and other fields) are multi-valued and parse them as such.

The fix should be straightforward. I'll have a PR for this soon. I think the way the multivalued properties are parsed is a little brittle (my bad!). This can be a good opportunity to tidy a few things up :)

sierra-moxon commented 2 years ago

@kevinschaper - is this impacting the pydantic validation you're doing downstream?

kevinschaper commented 2 years ago

Whoops, I should have replied a year ago. I was just looking at whether it would be practical to put kgx validate back into my pipeline and I noticed this problem again.

I'm not sure if I made a different issue for it at the time, but the biggest challenge is definitely the number of repeats for each type of error. It would definitely be more useful if it just reported the first N occurrences (5, 10, 30?) of each error, rather than an exhaustive list of each record that failed.

RichardBruskiewich commented 2 years ago

I'm not sure if I made a different issue for it at the time, but the biggest challenge is definitely the number of repeats for each type of error. It would definitely be more useful if it just reported the first N occurrences (5, 10, 30?) of each error, rather than an exhaustive list of each record that failed.

Hmm... didn't we fix the KGX reporting some time ago, to be less verbose?

goodb commented 2 years ago

Hi folks. I've been playing with kgx a bit and faced the same problem. I introduced this in my validate test to make the output a little more tractable. Seems to work for me in a validate/fix/validate dev cycle.


   validator.validate(the_graph)
    if len(validator.errors) > 0:
        for e in validator.errors:
            error_dict = validator.errors[e]
            for k in error_dict.keys():
                print("root error type: ", k, " specific errors:", error_dict[k].keys())
                for ek in error_dict[k].keys():
                    print(ek, " element count ", len(error_dict[k][ek]))
                    n = 0
                    for element in error_dict[k][ek]:
                        n += 1
                        print(element)
                        if n > n_error_examples:
                            break
            print("error")```
RichardBruskiewich commented 2 years ago

Resolved by https://github.com/biolink/kgx/pull/415 (will be in master once the PR is processed.. not sure about when the patch will show up in an official release...)

goodb commented 2 years ago

@RichardBruskiewich the problem with reporting very large numbers of errors isn't limited to the case here. Suppose that is a separate issue more related to the interface with the validation system.

RichardBruskiewich commented 2 years ago

Hi @goodb, As I noted earlier (above) in this issue, I seem to recall that we rehabilitated the error reporting to be somewhat more concise and possibly hierarchical. That said, I've not looked at that aspect of the code base for well over a year or so.

That said, I'll take another look at it over the next few days to confirm the status of things. If the validation reporting is still problematic, we might wish to open up a separate ticket.

I'm also going to compare our latest Translator validation code (in the pypi.org reasoner-validator package) which generally validates Translator web service data (including knowledge graphs) to check if there is suitable overlap in validation processes to apply a DRY upgrade to the KGX validation.

Cheers Richard

kevinschaper commented 1 year ago

It looks like #415 never made it in. I went looking for this issue because I just got the error with kgx 1.7.2, I wasn't able to run 2.0 in the cli, but I'll try running it as a module.

RichardBruskiewich commented 1 year ago

Hi @kevinschaper, I've just pinged Sierra to inquire about this since I'm not that active on the KGX front right now.

sierra-moxon commented 1 year ago

I went ahead and fixed up the changes in #415 to satisfy the good tests (see #439); I don't think they solve the entirety of this issue, so keeping this open pending more changes.