PacificBiosciences / pb-CpG-tools

Collection of tools for the analysis of CpG data
BSD 3-Clause Clear License
70 stars 6 forks source link

Add checks for MM and ML tag consistency #23

Closed ctsa closed 2 years ago

ctsa commented 2 years ago

Check for more inconsistent BAM aux field conditions and better describe error states when these occur.

Cancel remaining futures when an exception occurs so that the script doesn't stall on failure.


These changes improve the pileup scripts error messages when invalid (or unhandled variants of) MM/ML tags are encountered, using bam output from an older version of LIMA as an example, the output is now:

Exception thrown in worker process 1913520: Exception thrown while processing read m64107_201211_205017/160762333/ccs: Base modification offsets in MM tag for modification type 'C+m' are inconsistent with read length

...with an immediate exit, versus the previous:

Exception thrown in worker process 1913981: list index out of range

...followed by a what looks like a wait for every other process future to fail with the same issue.

By my tests, output is otherwise identical on valid MM/ML records.

Please let me know if you see any problems with these edits.

ctsa commented 2 years ago

Yes. First of all sorry there was a file-wide autoformat that ran in the IDE and didn't pull that back out before the PR, so that's obscuring things a bit.

get_modline() will return None (new behavior)

The old logic was compressed onto one line next(x[len(modcode)+1:] for x in mmtag.split(';') if x.startswith(modcode)), and was designed to throw an exception to signal that the case of an empty MM tag (meaning just "MM:Z:"). This prevented us from using exceptions to convey any other type of error, so that case is now made explicit with the None value/empty list return.

parse_mmtag() will then return an empty list (vs. list of base indices) (new behavior)

If the MM tag is empty, it was previously returning an empty generator via the try/except, now it will return an empty list. Most of the generators used for MM/ML parsing were changed to lists here to improve error reporting. This makes everything run a little slower but allows us to more easily check for and correctly describe more errors (like a mismatch in the number of MM and ML values).

parse_mltag() will return an empty list (old behavior)

Yes, previously an empty generator but same idea.

get_mod_dict() will return an empty dictionary (this could happen previously too?)

Yes.