NAL-i5K / GFF3toolkit

Python programs for processing GFF3 files
Other
95 stars 27 forks source link

Add Support for GZIP/BGZIP Compressed Files #105

Closed ctcncgr closed 4 years ago

ctcncgr commented 4 years ago

Hey All,

Really like the tool and am preferring its runtime over things like genome tools, which terminates after the first incongruity is found.

One thing, many fasta and gff3 files are conveniently compressed and indexed using bgzip to be served in various browsers using block compression; as well as conserve disk space. It would be awesome if the tool could read these if provided. The tool wouldn't really need to know if they were block compressed as gzip is compatible for the decompression.

Below I modified the necessary bits to run gff3_QC on compressed, uncompressed or combinations.

--- a/gff3tool/lib/check_gene_parent/find_wrongly_split_gene_parent.pl
+++ b/gff3tool/lib/check_gene_parent/find_wrongly_split_gene_parent.pl
@@ -19,7 +19,11 @@ my %id2owner = ();
 my $line = 0;
 my $typeflag = 0;
 print "Reading the gff file: $gff...\n";
-open FI, "$gff" or die "[Error] Cannot open $gff.";
+if ( $gff =~ /\.gz$/ ){  # gzip support
+    open FI, "<:gzip", $gff or die "[Error] Cannot open $gff.";
+} else {
+    open FI, "$gff" or die "[Error] Cannot open $gff.";
+}
--- a/gff3tool/lib/gff3/gff3.py
+++ b/gff3tool/lib/gff3/gff3.py
@@ -16,6 +16,7 @@ try:
 except ImportError:
     from urllib.parse import quote, unquote
 import re
+import gzip
 import string
 import logging
 import gff3tool.lib.ERROR as ERROR
@@ -69,7 +70,10 @@ def fasta_file_to_dict(fasta_file, id=True, header=False, seq=False):
     """
     fasta_file_f = fasta_file
     if isinstance(fasta_file, str):
-        fasta_file_f = open(fasta_file, 'r')
+        if fasta_file.endswith('.gz'):
+            fasta_file_f = gzip.open(fasta_file, 'rt')  # gzip support
+        else:
+            fasta_file_f = open(fasta_file, 'r')

     fasta_dict = OrderedDict()
     keys = ['id', 'header', 'seq']
@@ -528,7 +532,10 @@ class Gff3(object):

         gff_fp = gff_file
         if isinstance(gff_file, str):
-            gff_fp = open(gff_file, 'r')
+            if gff_file.endswith('.gz'):
+                gff_fp = gzip.open(gff_file, 'rt')  # gzip support
+            else:
+                gff_fp = open(gff_file, 'r')

Anyway, really like the tool and the general idea of the various classes of incongruity with the sequence ontology. Looking forward to further development.

Thank you

mpoelchau commented 4 years ago

Thanks for the contribution, @ctcncgr, and sorry about the slow reply!

This looks great - would you mind submitting a PR, and we'll test it?

ctcncgr commented 4 years ago

Sure can do. I'll submit a pr tomorrow.

ShangYuChiang commented 4 years ago

Tested. Thanks for the pointers @ctcncgr. Closing this - feel free to reopen if you notice anything could be improved