NAL-i5K / GFF3toolkit

Python programs for processing GFF3 files
Other
95 stars 27 forks source link

Update error handling in gff3_QC #114

Closed mpoelchau closed 2 years ago

mpoelchau commented 3 years ago
  1. Create new error (definitely violating the specification)/warning (probably violating the specification)/info (might be worth checking) classes

    • research the best, or most standard, way to handle these
  2. Modify the gff3toolkit to change the following error messages for gff3_QC - see Flag type column:

Intra-model: Multiple features within a model (Ema)


The error category 'Intra-model' collects formatting errors that can be
found by jointly considering multiple features within a gene model, such
as gene, mRNA, exon, and CDS features. Errors in this category are given
an 'Error\_Code' starting with 'Ema'.

+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Error\_Code   | Error\_Tag                                                                              | Flag type   |
+===============+=========================================================================================+============================+
| Ema0001       | Parent feature start and end coordinates exceed those of child features                 | Warning                        |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0002       | Protein sequence contains internal stop codons                                          | Warning                         |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0003       | This feature is not contained within the parent feature coordinates                     | Warning                        |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0004       | Incomplete gene feature that should contain at least one mRNA, exon, and CDS            | Info                         |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0005       | Pseudogene has invalid child feature type                                               | Info (we need to replace this function in the future)                        |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0006       | Wrong phase                                                                             | Info (we need to replace this function in the future)                         |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0007       | CDS and parent feature on different strands                                             | Warning                        |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0008       | Warning for distinct isoforms that do not share any regions                             | Warning                         |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0009       | Incorrectly merged gene parent? Isoforms that do not share coding sequences are found   | Warning                         |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+

Inter-model: Multiple features across models (Emr)

The error category 'Inter-model' collects formatting errors that can be found by comparing multiple gene models. Errors in this category are given an 'Error_Code' starting with 'Emr'.

+---------------+----------------------------------+----------------------------+ | Error_Code | Error_Tag | Checked if non-canonical | +===============+==================================+============================+ | Emr0001 | Duplicate transcript found | Warning | +---------------+----------------------------------+----------------------------+ | Emr0002 | Incorrectly split gene parent? | Warning | +---------------+----------------------------------+----------------------------+ | Emr0003 | Duplicate ID | Error | +---------------+----------------------------------+----------------------------+

Single feature (Esf)



The error category 'Single Feature' collects formatting errors that can
be found by searching the GFF3 file line by line. Errors in this
category are given an 'Error\_Code' starting with 'Esf'.

+---------------+--------------------------------------------------------------------------+----------------------------+
| Error\_Code   | Error\_Tag                                                               | Checked if non-canonical   |
+===============+==========================================================================+============================+
| Esf0001       | Feature type may need to be changed to pseudogene                        | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0002       | Start/Stop is not a valid 1-based integer coordinate                     | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0003       | strand information missing                                               | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0004       | Seqid not found in any ##sequence-region                                 | Error                       |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0005       | Start is less than the ##sequence-region start                           | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0006       | End is greater than the ##sequence-region end                            | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0007       | Seqid not found in the embedded ##FASTA                                  | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0008       | End is greater than the embedded ##FASTA sequence length                 | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0009       | Found Ns in a feature using the embedded ##FASTA                         | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0010       | Seqid not found in the external FASTA file                               | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0011       | End is greater than the external FASTA sequence length                   | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0012       | Found Ns in a feature using the external FASTA                           | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0013       | White chars not allowed at the start of a line                           | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0014       | ##gff-version" missing from the first line                               | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0015       | Expecting certain fields in the feature                                  | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0016       | ##sequence-region seqid may only appear once                             | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0017       | Start/End is not a valid integer                                         | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0018       | Start is not less than or equal to end                                   | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0019       | Version is not "3"                                                       | Info (we'll need to look into this later)                       |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0020       | Version is not a valid integer                                           | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0021       | Unknown directive                                                        | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0022       | Features should contain 9 fields                                         | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0023       | escape certain characters                                                | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0024       | Score is not a valid floating point number                               | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0025       | Strand has illegal characters                                            | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0026       | Phase is not 0, 1, or 2, or not a valid integer                          | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0027       | Phase is required for all CDS features                                   | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0028       | Attributes must escape the percent (%) sign and any control characters   | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0029       | Attributes must contain one and only one equal (=) sign                  | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0030       | Empty attribute tag                                                      | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0031       | Empty attribute value                                                    | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0032       | Found multiple attribute tags                                            | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0033       | Found ", " in a attribute, possible unescaped                            | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0034       | attribute has identical values (count, value)                            | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0035       | attribute has unresolved forward reference                               | Info (for now, need to look into this more)                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0036       | Value of a attribute contains unescaped ","                              | Info (for now, need to check whether multiple Target values are possible)                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0037       | Target attribute should have 3 or 4 values                               | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0038       | Start/End value of Target attribute is not a valid integer coordinate    | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0039       | Strand value of Target attribute has illegal characters                  | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0040       | Value of Is\_circular attribute is not "true"                            | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0041       | Unknown reserved (uppercase) attribute                                   | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
ZhiXuanLai commented 3 years ago

There is an eCode called Esf0042 in gff3.py line #866 with the message 'Unresolved forward reference.' However, the eCode Esf0042 is currently not documented in gff3_QC full documentation and not listed in the dictionary of ERROR.py as well. It might need to be documented.

ZhiXuanLai commented 3 years ago

Attached is the documentation for three categories of general error handling in gff3QC codes (function4gff.py, intra_model.py, inter_model.py, gff3_QC.py, and gff3.py). error handling in gff3_QC.docx

ZhiXuanLai commented 3 years ago

Since both gff3 QC errors and program errors are handled with pre-defined log levels of the logging module, we considered defining separate levels for gff3 QC errors (like gff3_ERROR). The way to do this is to use the logging.addLevelName function and setup like the following example.

import logging
DEBUG_LEVELV_NUM = 9 
logging.addLevelName(DEBUG_LEVELV_NUM, "DEBUGV")
def debugv(self, message, *args, **kws):
    if self.isEnabledFor(DEBUG_LEVELV_NUM):
        # Yes, logger takes its '*args' as 'args'.
        self._log(DEBUG_LEVELV_NUM, message, args, **kws) 
logging.Logger.debugv = debugv

(reference: https://newbedev.com/how-to-add-a-custom-loglevel-to-python-s-logging-facility)

However, we decided not to do it. Because gff3 QC errors are already put in the QC report with clear error messages for each eCode, defining new error levels adds unnecessary complications. In addition, custom levels are not recommended for developing a library in the python document. As a result, the error update for this issue will use the current handling method.

mpoelchau commented 2 years ago

Closed via #116 .