airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Python validate method... #142

Closed bcorrie closed 5 years ago

bcorrie commented 6 years ago

Hi All (@javh I think),

I am using the AIRR TSV python library to process AIRR TSV files - essentially to take an AIRR TSV file generated from igblast and load it into the iReceptor repository. Very useful thus far, the fact that I can do the below with 3 lines of code is brilliant!

I did want to point out one thing that I found to be counter intuitive...

I want to do the following, which seems logical:

    file_handle = open(path, 'r')
    airr_data = airr.read_rearrangement(file_handle)
    if (!airr_data.validate()) abort
    for r in airr_data:
            process an AIRR record

The RearrangementReader.validate() method uses the airr_data iterator to iterate over the file, and once it gets to the end of the file, it closes the file handle. Thus the above workflow doesn't work as the for loop on the airr_data iterator is no longer valid (it is at the end of the file and the handle is closed).

I was wondering if that was the intent as it seems a bit counter-intuitive to me. Would it make more sense to have an airr.validate() method that creates an internal iterator so the validate method doesn't have a side effect on the iterator itself.

If nothing else, maybe add a comment to the validate method that it has the side effect of closing the handle and moving the iterator to the end of the file???

Brian

bcorrie commented 6 years ago

Hmmm in looking at it, there is already the method airr.validate_rearrangement([handle]) that does this, so maybe just some documentation in the RearrangementReader.validate() method that mentions the side effect would do.

javh commented 6 years ago

We discussed this before, but I don't recall the issue number... Yes, I agree it's counter intuitive.

I'll have to look at the implementation, but we might be able to fix this by just removing the automatic closure of the handle and doing a self.handle.seek(0) call at the end of the validate method (to reset the iterator).

Or we could move validate out of the reader. Maybe into the Schema definition, so it works like:

file_handle = open(path, 'r')
if (!RearrangementSchema.validate(file_handle)):
    abort
airr_data = airr.read_rearrangement(file_handle)
for r in airr_data:
    process an AIRR record
schristley commented 6 years ago

I'd thrown this into the API to support the command line airr-tools validate and didn't think too hard about the API semantics. My thought is we should stick with the defined interface which essentially requires two handles, and avoid trying to re-use them.

if (!airr.validate_rearrangement(open(path, 'r')):
   abort
airr_data = airr.read_rearrangement(open(path, 'r'))
for r in airr_data:
   process an AIRR record
laserson commented 6 years ago

We should not be reading all the records just to validate the file. Really, you probably want to validate that the file looks sane (i.e., has an appropriate header), and then validate each row as you read it. So when you instantiate a reader, the constructor should probably take a boolean flag validate=True as a default which checks the header immediately, and also causes a validation check on each record.

laserson commented 6 years ago

We could also have two separate flags so that you can have a valid header but tolerate an invalid record.

schristley commented 6 years ago

Really, you probably want to validate that the file looks sane (i.e., has an appropriate header), and then validate each row as you read it.

I'm fine with adding a mode for on-the-fly validation, and I agree that would be a good thing to add, but please keep the full file validation as currently provided by airr-tools validate

schristley commented 6 years ago

Also, I just added a few basic checks so that we could say we perform validation, but it should be filled out to be more complete. The validation for uniqueness of sequence ids likely needs to be given more thought.

bcorrie commented 6 years ago

If there is a method that validates an entire file and that is all it does, I would suggest that the input is simply a path and not a file handle. The file handling is then opaque to the user of the method.

For example:

if (!airr.validate_rearrangement(path): abort

I also like the idea of validating on the fly, but I agree these are two different things, and maybe belong in different places (should validate_rearrangement be part of the reader?)

schristley commented 6 years ago

@javh and I had a mini-debate about filenames vs. handles for the interface a long time ago. I don't think either of us had a strong opinion, so I think we stuck with handles, but I could be mis-remembering. I suppose there are some rare exception use cases where the programmer wants to do something to the file before passing it to the library, and the handle supports that. I'm not sure if it's a good idea to have a mixed interface with some taking handles and some taking filenames though.

bcorrie commented 6 years ago

Having a validate_rearrangement utility function that takes a handle is probably OK, as long as it is clear what are the pre and post conditions for the handle. With a path, I think there is much less confusion about that as the handle is never exposed, but that isn't the end of the world.

The thing that is confusing is that the RearrangementReader.validate() method is that it has a significant side effect on the handle. Since it is a method of the RearrrangementReader that has a side effect on the handle, it kind of breaks the "iterator"-ness of the iterator 8-)

What about having the validate method validate a single row. This is what Uri was talking about:

airr_data = airr.read_rearrangement(file_handle)
for r in airr_data:
    if airr_data.validate(r)
        process an AIRR record

Here the validate method doesn't change the handle, but it uses the AIRR schema of the RearrangementReader to check the validity of the record.

schristley commented 6 years ago

All of the functions in interface.py are file operations, so there isn't support to mix and match handles with these calls. For example, it's not valid to call airr.read_rearrangement(), read in a few records, then be able to call airr.derive_rearrangement() or airr.merge_rearrangement() and expect them work properly. Likewise airr.validate_rearrangement() cannot mix its handle with the other functions.

On a side note, this is were I get annoyed with python as it isn't true OO. I wanted RearrangementReader.validate() to be a hidden, internal function so the only exposed interface was airr.validate_rearrangement() which operates on the whole file. But everything is exposed everywhere in python. To further complicate the semantics, RearrangementReader acts both as a file-level class (it maintains list of fields that apply to the whole file) and as a row-level class (by providing iteration over the), so if you add a method to RearrangementReader, is it allowed to do file-level stuff, row-level stuff, or both?

Anyways pardon my venting, let's step back and re-design the validation. Here are some things that might be useful to validate (please add if you think of others).

Some of these can be considered file-level as they either only need to be done once when the file is opened, or they need to read through all the rows. While others can be considered row-level as they validate fields for a single row. So how should we design the interface?

javh commented 6 years ago

Handles vs file paths

We went with handles because that lets you use things like string buffers instead of actual files. It's also consistent with the behavior of the modules in the python standard library.

I'm totally fine with having the wrappers (eg, airr.read_rearrangement) take a file path, while the reader/writer class (eg, RearrangementReader) initializes with a handle instead. But I think mixing them so the input is ambiguous (like pandas allows) is non-pythonic. As long as there's a clear division between what takes a file path (string) and what takes a file object (handle), then I think it's fine.

Side note: I think we should remove the closure of the handle upon reaching the stop iteration anyway. That's a hidden, and probably unexpected, convenience operation. Garbage collection will deal with that anyway if a user doesn't explicitly close it or use a with statement (which is recommended practice).

Validation

What's the use case for the validation? It seems to me there are two at play here:

1) Validating an entire file separately from working with it. Eg, uploading it to a server for some task, but it'll sit in a queue before being analyzed. You might want to immediately kick back a message about the file being malformed. 2) Checking a row before performing a task on it.

It seems to me that we could satisfy both with a validate=True argument to RearrangementReader that throws an exception (rather than returning False) for an invalid header (upon init) or row (when yielding the row).

Then just have the wrapper that does the full file validation (airr.validate_rearrangement) return False the first time it hits that exception.

schristley commented 6 years ago

throws an exception

My lack of python understanding here, but can the file continue to be validated after the exception is thrown? Specifically, in the first use case, would want all of the validation errors to be reported instead of stopping after it hits the first one.

javh commented 6 years ago

My lack of python understanding here, but can the file continue to be validated after the exception is thrown? Specifically, in the first use case, would want all of the validation errors to be reported instead of stopping after it hits the first one.

I'll have to check the behavior of the csv module, but I think it's continuable (so put a try/except in the for loop that iterates over the reader and continue to the next record when catching an exception). Though, I feel like it would be better for the first use case to stop on the first exception. Then you only parse as much of the file as needed to validate it.

schristley commented 6 years ago

Though, I feel like it would be better for the first use case to stop on the first exception. Then you only parse as much of the file as needed to validate it.

Okay, then we need a third use case. I coded it right now to give all the errors because that's the most efficient for our server analysis workflow. If we stop on the first error, then go through some elaborate process to re-generate the data, only to re-validate and find there was another error on row 2, you only need to rinse and repeat that a few times before your head explodes. ;-D

javh commented 6 years ago

I think that's the same as the second use case. You'd just save the exception message and ignore the data itself.

schristley commented 6 years ago

I think that's the same as the second use case. You'd just save the exception message and ignore the data itself.

Okay, just so I can run airr-tools validate on a file and get all the errors like I can now then I'm happy. If that needs a parameter to say stop at first or give all errors, then that's fine.

schristley commented 6 years ago

@javh, I wrote some initial code passing validate=True to RearrangementReader where an exception is thrown when validation error is found. The code is on the validate branch. I've been testing with the example bad data file:

airr-tools validate -a bad_data.tsv

The problem I'm having is after __init__ throws a ValidationException, I'm not able to use the reader to continue reading the rows

$ airr-tools validate -a bad_data.tsv 
Validating: bad_data.tsv
File is missing AIRR mandatory fields (sequence,junction).

Error occurred while validating AIRR rearrangement files.
local variable 'reader' referenced before assignment

You can see how I'm trapping the exception and trying to continue here

I think the issue is that the exception essentially aborts the reader = RearrangementReader(handle, validate=True) command and reader isn't assigned the value.

I think the same problem will occur with next(), how do I get it to both throw an exception and return the row at the same time...

javh commented 6 years ago

Hey @schristley, I'll take a look at it either tomorrow or over the weekend if I don't manage to make it into the office tomorrow. Not sure on the __init__() issue, but that might be something you can move into __iter__() instead, which is where initializing of the iterator occurs.

I don't think you can get it to return the row and throw and exception from __next__(). Does it need to? I was thinking the exception would be raised inside __next__(), so that if you validate you either get a row or an exception.

schristley commented 6 years ago

Actually @javh, never mind. I was thinking about having the validation do all types of checks that I thought would be useful but after thinking about it, I realized that many of them aren't true validation errors, so we cannot throw errors for them. The only thing we can really enforce for validation is stuff in the spec like the required fields and types. So I'll clean up the code to just raise the exception and be done with it.

javh commented 6 years ago

What's our decision on this?

having the wrappers (eg, airr.read_rearrangement) take a file path, while the reader/writer class (eg, RearrangementReader) initializes with a handle instead.

schristley commented 6 years ago

Brian might need to try out the new code with his workflow to see if solves his issues. With validate=True passed to RearrangementReader, there shouldn't be a need to call the validate functions directly, so maybe the handle/file issue is a moot point, or at least a separate question from validation.

javh commented 6 years ago

Brian might need to try out the new code with his workflow to see if solves his issues. With validate=True passed to RearrangementReader, there shouldn't be a need to call the validate functions directly, so maybe the handle/file issue is a moot point, or at least a separate question from validation.

Okay. I think I'll throw in my vote for file paths in the Interface functions and handles in the IO classes. Just to justify the existence of this redundancy in my mind. :)

javh commented 6 years ago

@bcorrie, we're going to make a couple extra tweaks to the validate operation today. I'll drop a note in here when it's done, so you can test once it's semi-final.

laserson commented 6 years ago

Is there a patch I can look at somewhere?

javh commented 6 years ago

All the v1.2 prep changes are going into the rearrangement_updates branch. Making some adjustments now.

javh commented 6 years ago

Done tweaking the validation code. Still needs testing.

bcorrie commented 6 years ago

@javh do you want me to test this now?

javh commented 6 years ago

@bcorrie so, i just realized this issue took a bit of a sideways turn... It's ready to be tested, but looking back at your original post, you'll still hit the end of the handle after validation. So this is still needed:

file_handle = open(path, 'r')
if (!airr.validate_rearrangement(file_handle)):
    complain

file_handle.seek(0)
airr_data = airr.read_rearrangement(file_handle)
for r in airr_data:
    process an AIRR record

But it won't close the handle anymore.

But this should also be an option:

file_handle = open(path, 'r')
airr_data = airr.read_rearrangement(file_handle, validate=True)
for r in airr_data:
    try:
        process an AIRR record
    except:
        panic and abort

Also, it seems like opinions are few on the file path vs handle issue, so barring objections, I'll go ahead and change that tomorrow, which would really address your originally issue... (Ie, have airr.read_rearrangement take a path and airr.io.RearrangementReader take a handle.)

bcorrie commented 6 years ago

@javh sounds good...

My use case is that I want to validate the entire file before I process anything. My processing is putting data into a repository, and I don't want to enter any data into the repository if some of the data in the file is not valid. So your first option is the correct one for me.

Either way, it should be made clear in the documentation that the side effect of:

airr.validate_rearrangement(file_handle)

is to modify the file handle. I suspect that after the call one probably can't be sure where the file handle is, correct? It might be at the beginning (if the header fails) or it might be part way through the file (at record N if record N is invalid), or it might be at the end if the file is valid.

Is that correct? If so, that should be made clear in the docs - as that is what through me originally 8-)

Brian

javh commented 6 years ago

It should now run all the way through to the end of the handle regardless of how many errors it hits. Which is, I think, normal expected behavior for a file handle. But it'll probably be more intuitive if we just change it to airr.validate_rearrangement("path"). The handle API stuff can then be the way to handle skipping individual rows that fail validation.

bcorrie commented 6 years ago

I like that interface... 8-)

schristley commented 6 years ago

It's probably best to use path so that it's obvious those functions are atomic on the whole file. No matter how much you document, somebody is gonna try to mix read, merge, validate, create, etc., with the same handle and then complain that it doesn't work as they expect ;-)

@bcorrie, your use case is mine as well, which why I had the command line airr-tools validate as a simple check, but I do see the usefulness to walking through the records yourself. Specifically now, as the validation can only do minimal checks based upon the schema, I'll need to write my own code that does consistency checks on the field values.

javh commented 6 years ago

Okay guys, I swapped the behavior to accept a single file name instead of a list of handles. Test it out?

schristley commented 6 years ago

Couple minor changes but otherwise looks good!

I don't like checking the field name to determine coordinate shift. I think we should have an attribute in the schema for that, not sure if we want to do that now or later? Maybe also move that to a schema function.

RearrangementWriter write() is doing a validation check, should we get rid of that, keep it but have it call the proper validate function, or leave it as is?

javh commented 6 years ago

I don't like checking the field name to determine coordinate shift. I think we should have an attribute in the schema for that, not sure if we want to do that now or later? Maybe also move that to a schema function.

Not a huge fan either, but the alternative is a lot more code.

RearrangementWriter write() is doing a validation check, should we get rid of that, keep it but have it call the proper validate function, or leave it as is?

I didn't even notice that. Let's just remove it?

schristley commented 6 years ago

Not a huge fan either, but the alternative is a lot more code.

I don't think it's that much, it would look like the existing type function. Not sure what is a good name though... Something like this:

    def is_coordinate(self, field):
        """
        Determine if field is a coordinate that needs to be shifted.
        Arguments:
          name (str): field name.
        Returns:
          bool: True if field is a coordinate
        """
        field_spec = self.properties.get(field, None)
        field_type = field_spec.get('x-coordinate', True) if field_spec else False
        return field_type

we would need to add the attribute to the schema too

schristley commented 6 years ago

I didn't even notice that. Let's just remove it?

Yeah, probably. If I remember correctly, we are enforcing that the required fields are in the output file, regardless of whether individual rows have the data or not.

That makes me wonder, are we enforcing that a row returned during reading has the required fields in it, even if the file does not?

javh commented 6 years ago

I don't think it's that much, it would look like the existing type function.

Yeah, we'll just have to add and x-coordinate-start and x-coordinate-end entries to everything relevant in the schema. (Because only the start needs to be shifted.) Not a huge deal, I suppose. Should we do that?

That makes me wonder, are we enforcing that a row returned during reading has the required fields in it, even if the file does not?

No. It returns only whatever columns are in the file (including null values).

javh commented 6 years ago

x-coordinate-start and x-coordinate-end

Err... Or:

x-coordinate: start
x-coordinate: end
x-coordinate: length

Instead of:

x-coordinate-start: true
x-coordinate-end: true

Which one?

javh commented 6 years ago

Wait... that validation check in the writer is actually just a debug statement. I don't think we need to change that. It just issues a warning if debug=True.

schristley commented 6 years ago

Wait... that validation check in the writer is actually just a debug statement. I don't think we need to change that. It just issues a warning if debug=True.

That's true, doesn't hurt to leave it in.

javh commented 6 years ago

I checked that this is continuable:

data = airr.read_rearrangement('data.tsv', validate=True)
for r in data:
    try:
        do_something(r)
    except ValidationError:
        continue

It works. It won't if you have a header validation error, but I think that's fine. It's predictable behavior. (And I don't know away to fix that anyway with a for loop structure.)

So I think we're done here? I'll wait for @bcorrie to test before closing, but I think we can tag v1.2.0 now.

bcorrie commented 5 years ago

@javh I think we can close this... Certainly my use case is now working... Will leave to you to agree and close the issue 8-)

javh commented 5 years ago

@bcorrie, sounds good.