Open david4096 opened 8 years ago
The gff3 specification is unclear if one can have multiple attributes with the same name. For some attributes, it indicates they can be multi-valued with comma separators.
In this cases, the multiple idty are in the match record types, which which not be imported when creating sequence annotations.
Many a GFF file have been preprocessed with awk.
David Steinberg notifications@github.com writes:
RefSeq genes causes the below failure in generate_gff3_db.py script.
(serverenv)azureuser@ga4gh-david:~/server$ time python scripts/generate_gff3_db.py -i ref_GRCh37.p13_top_level.gff3 -o ref_GRCh37.p13_top_level.db Running GFF3 parser... Traceback (most recent call last): File "scripts/generate_gff3_db.py", line 134, in
main() File "/home/azureuser/server/scripts/utils.py", line 34, in wrapper result = func(_args, *_kwargs) File "scripts/generate_gff3_db.py", line 130, in main g2d.run() File "scripts/generate_gff3_db.py", line 72, in run gff3Data = gff3.Gff3Parser(self.gff3File).parse() File "/home/azureuser/server/ga4gh/gff3Parser.py", line 335, in parse self._parseLine(gff3Set, line[0:-1]) File "/home/azureuser/server/ga4gh/gff3Parser.py", line 324, in _parseLine self._parseRecord(gff3Set, line) File "/home/azureuser/server/ga4gh/gff3Parser.py", line 303, in _parseRecord self._parseAttrs(row[8])) File "/home/azureuser/server/ga4gh/gff3Parser.py", line 281, in _parseAttrs self.fileName, self.lineNumber) ga4gh.gff3Parser.GFF3Exception: ref_GRCh37.p13_top_level.gff3:1050195: duplicated attribute name: idty real 4m53.859s user 4m45.524s sys 0m8.308s
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub*
@diekhans what's your suggested solution in our case?
pre-filter to discard the match records as they have the problem and we don't want them anyway. Maybe just add a pre-filter function attribute to the parser.
RefSeq genes causes the below failure in
generate_gff3_db.py
script.