EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
96 stars 18 forks source link

Mikado util stats error on NCBI gff3 #226

Closed bbista closed 5 years ago

bbista commented 5 years ago

I was trying to look at the stats for a gff3 file I downloaded off NCBI. I get this error message. mikado util stats GCF_000241765.genomic.gff genomic.stats /home/bbista/.local/lib/python3.6/site-packages/Mikado/configuration/configurator.py:529: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. scoring = yaml.load(scoring_file) 2019-10-02 19:04:43,336 - main - init.py:124 - ERROR - main - MainProcess - Mikado crashed, cause: 2019-10-02 19:04:43,336 - main - init.py:125 - ERROR - main - MainProcess - gene-LOC112059410 {} Traceback (most recent call last): File "/home/bbista/.local/lib/python3.6/site-packages/Mikado/init.py", line 110, in main args.func(args) File "/home/bbista/.local/lib/python3.6/site-packages/Mikado/subprograms/util/stats.py", line 711, in launch calculator() File "/home/bbista/.local/lib/python3.6/site-packages/Mikado/subprograms/util/stats.py", line 335, in call self.parse_input() File "/home/bbista/.local/lib/python3.6/site-packages/Mikado/subprograms/util/stats.py", line 324, in parse_input current_gene.add_exon(record) File "/home/bbista/.local/lib/python3.6/site-packages/Mikado/loci/reference_gene.py", line 165, in add_exon raise AssertionError("{}\n{}".format(parent, self.transcripts, row)) AssertionError: gene-LOC112059410 {} Do you have any idea what is going wrong?

Best, Basanta

lucventurini commented 5 years ago

Dear @bbista , thank you for reporting this bug. I just inspected the GFF you mentioned (which I presume is from this folder: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/241/765/GCF_000241765.3_Chrysemys_picta_bellii-3.0.3/ ) and the problem stems from the fact that gene-LOC112059410 is a pseudogene without any transcript feature associated to it. That's kinda invalid under the GFF ontology, and Mikado was explicitly written not to accommodate such a case.

Looking more in detail at the GFF file, there honestly seem to be a lot of similar problems, such as coding genes without mRNAs, or tRNAs without a gene parent. All of these break the gene ontology and Mikado's model of how a GFF should look like.

I also tried using the GTF, cleaning it up first with GffRead, but to no avail. The only solutions are

lucventurini commented 5 years ago

Dear @bbista, I have started fixing the problems you found.

With the latest commit, mikado util stats is now able to parse the file appropriately. I will now work on making mikado compare compatible as well.

Changes will be reflected in Mikado2 (and live in Mikado 2.0rc6).

Kind regards

lucventurini commented 5 years ago

Current status: mikado now supports this problematic GFF in util stats, util convert, compare, prepare. The only utility left before the issue can be closed is mikado util grep. Once that is fixed, this issue can be closed.