The function GFF.parse has the option base_dict that is a dictionary of SeqRecord object to which gff entries are added upon parsing. If base_dict is an OrderedDict (default in newer Python versions), the input order gets scrambled due to code in the function parse_in_parts:
def parse_in_parts(self, gff_files, base_dict=None, limit_info=None,
target_lines=None):
"""Parse a region of a GFF file specified, returning info as generated.
target_lines -- The number of lines in the file which should be used
for each partial parse. This should be determined based on available
memory.
"""
for results in self.parse_simple(gff_files, limit_info, target_lines):
if base_dict is None:
cur_dict = dict()
else:
cur_dict = copy.deepcopy(base_dict)
cur_dict = self._results_to_features(cur_dict, results)
all_ids = list(cur_dict.keys())
all_ids.sort()
for cur_id in all_ids:
yield cur_dict[cur_id]
The statement all_ids.sort() reorders the keys. Is this necessary, and if so, would it be possible to add an option preserve_order to GFF.parse to allow the possibility to avoid this behaviour?
The function
GFF.parse
has the optionbase_dict
that is a dictionary of SeqRecord object to which gff entries are added upon parsing. Ifbase_dict
is anOrderedDict
(default in newer Python versions), the input order gets scrambled due to code in the function parse_in_parts:The statement
all_ids.sort()
reorders the keys. Is this necessary, and if so, would it be possible to add an optionpreserve_order
toGFF.parse
to allow the possibility to avoid this behaviour?We are using this function in a package for generating annotated genome assembly files, cf https://github.com/NBISweden/EMBLmyGFF3/issues/83.