daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
282 stars 76 forks source link

create_db() does not parse directives from GFF files starting in v0.11.0 #213

Open dtdoering opened 1 year ago

dtdoering commented 1 year ago

I am trying to read in a GFF that doesn't adhere to spec so I can use gffutils to fix it and then write out a corrected file. I've gotten everything working, except that the GFF header directive(s) don't appear in the output file. It appears they are not being parsed upon creation of the database.

Interestingly, this only happens when a FeatureDB is created from a file, not from e.g. a dedent()ed string as in the existing parser_test.py.

This behavior is exhibited in v0.11.1 and v0.11.0, but not v0.10.1 (thus, a workaround is to downgrade to v0.10.1).

To reproduce:

Use gffutils > v0.10.1.

Create a test file test.gff with the following:

##gff-version 3
.   .   .   .   .   .   .   .
.   .   .   .   .   .   .   .
.   .   .   .   .   .   .   .
.   .   .   .   .   .   .   .

Example code:

import gffutils
db = gffutils_create_db('test.gff', dbfn='test.db', force=True)
print(len(db.directives))

Expected output:

1

Observed output:

0

System/environment info:

OS: GNU/Linux

Python: 3.8.5 (still happens with 3.8.13)

Conda environment:

name: gffutils-py3_8_5
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - argcomplete=3.0.5=pyhd8ed1ab_0
  - argh=0.27.2=pyhd8ed1ab_0
  - biopython=1.81=py38h1de0b5d_0
  - ca-certificates=2022.12.7=ha878542_0
  - gffutils=0.11.1=pyh7cba7a3_0
  - ld_impl_linux-64=2.40=h41732ed_0
  - libblas=3.9.0=16_linux64_openblas
  - libcblas=3.9.0=16_linux64_openblas
  - libffi=3.2.1=he1b5a44_1007
  - libgcc-ng=12.2.0=h65d4601_19
  - libgfortran-ng=12.2.0=h69a702a_19
  - libgfortran5=12.2.0=h337968e_19
  - libgomp=12.2.0=h65d4601_19
  - liblapack=3.9.0=16_linux64_openblas
  - libopenblas=0.3.21=pthreads_h78a6416_3
  - libsqlite=3.40.0=h753d276_0
  - libstdcxx-ng=12.2.0=h46fd767_19
  - libzlib=1.2.13=h166bdaf_4
  - ncurses=6.3=h27087fc_1
  - numpy=1.24.2=py38h10c12cc_0
  - openssl=1.1.1t=h0b41bf4_0
  - packaging=23.0=pyhd8ed1ab_0
  - pip=23.0.1=pyhd8ed1ab_0
  - pyfaidx=0.7.2.1=pyh7cba7a3_1
  - python=3.8.5=h1103e12_9_cpython
  - python_abi=3.8=3_cp38
  - pyvcf3=1.0.3=pyhdfd78af_0
  - readline=8.2=h8228510_1
  - setuptools=67.6.1=pyhd8ed1ab_0
  - simplejson=3.18.4=py38h1de0b5d_0
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.40.0=h4ff8645_0
  - tk=8.6.12=h27826a3_0
  - wheel=0.40.0=pyhd8ed1ab_0
  - xz=5.2.6=h166bdaf_0
  - zlib=1.2.13=h166bdaf_4
prefix: /home/dtdoering__lbl.gov/.conda/envs/gffutils-py3_8_5
daler commented 1 year ago

@dtdoering I'm unable to reproduce in a test -- see the new test at https://github.com/daler/gffutils/pull/221/commits/a4b443b1eb8f9c17d2b6664f9c63f53b8b1cb0a9 and passing tests here.

Are you able to reproduce this with the latest in the v0.12rc branch? Or, if you don't get around to it for a little while, with v0.12 once it's released?

dtdoering commented 9 months ago

Hey @daler, getting back around to this now!

I pasted your test_issue_213 function into a new script, and am getting the same results as you on the latest v0.12 tag (i.e. no issue).

However, I must have done a poor job with my initial description of the issue because I am now only able to reproduce the issue again by supplying a dialect argument in the create_db calls in your test function. This is something I had been/am doing in my real-world code when i first encountered the issue, so I'm not sure how I was reproducing the issue without that argument. My apologies! 😅

At any rate, by supplying dialect={'fmt': 'gff3'} to the create_db calls in your test function, I am now indeed getting AssertionErrors, whereas I don't get them if I omit the dialect argument and let it figure out the dialect.

dtdoering commented 9 months ago

For the sake of completeness, here is the script (gffutils_debug.py) that I'm running your test with:

#!/usr/bin/env python

import sys
import os
import tempfile
from textwrap import dedent

gffutils_git_path = os.path.join(os.environ.get('HOME'), 'sft', 'gffutils')

sys.path.insert(1, gffutils_git_path)

import gffutils

print(f"gffutils version: {gffutils.__version__}")

def test_issue_213():
    # GFF header directives seem to be not parsed when building a db from
    # a file, even though it seems to work fine from a string.
    data = dedent(
        """
    ##gff-version 3
    .   .   .   .   .   .   .   .
    .   .   .   .   .   .   .   .
    .   .   .   .   .   .   .   .
    .   .   .   .   .   .   .   .
    """
    )

    # Ensure directives are parsed from DataIterator
    it = gffutils.iterators.DataIterator(data, from_string=True)
    assert it.directives == ["gff-version 3"]

    # Ensure they're parsed into the db from a string
    db = gffutils.create_db(data, dbfn=":memory:", from_string=True, verbose=False, dialect={'fmt': 'gff3'})
    assert db.directives == ["gff-version 3"], db.directives

    # Ensure they're parsed into the db from a file
    tmp = tempfile.NamedTemporaryFile(delete=False).name
    with open(tmp, "w") as fout:
        fout.write(data + "\n")
    db = gffutils.create_db(tmp, ":memory:", dialect={'fmt': 'gff3'})
    assert db.directives == ["gff-version 3"], db.directives
    assert len(db.directives) == 1

    # Ensure they're parsed into the db from a file, and going to a file (to
    # exactly replicate example in #213)
    db = gffutils.create_db(tmp, dbfn='issue_213.db', force=True, dialect={'fmt': 'gff3'})
    assert db.directives == ["gff-version 3"], db.directives
    assert len(db.directives) == 1

test_issue_213()

Then I run it with:

~/gffutils_debug.py