ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 93 forks source link

RefSeq gff3 parsing #1062

Open david4096 opened 8 years ago

david4096 commented 8 years ago

RefSeq genes causes the below failure in generate_gff3_db.py script.

(serverenv)azureuser@ga4gh-david:~/server$ time python scripts/generate_gff3_db.py -i ref_GRCh37.p13_top_level.gff3 -o ref_GRCh37.p13_top_level.db
Running GFF3 parser...
Traceback (most recent call last):
  File "scripts/generate_gff3_db.py", line 134, in <module>
    main()
  File "/home/azureuser/server/scripts/utils.py", line 34, in wrapper
    result = func(*args, **kwargs)
  File "scripts/generate_gff3_db.py", line 130, in main
    g2d.run()
  File "scripts/generate_gff3_db.py", line 72, in run
    gff3Data = gff3.Gff3Parser(self.gff3File).parse()
  File "/home/azureuser/server/ga4gh/gff3Parser.py", line 335, in parse
    self._parseLine(gff3Set, line[0:-1])
  File "/home/azureuser/server/ga4gh/gff3Parser.py", line 324, in _parseLine
    self._parseRecord(gff3Set, line)
  File "/home/azureuser/server/ga4gh/gff3Parser.py", line 303, in _parseRecord
    self._parseAttrs(row[8]))
  File "/home/azureuser/server/ga4gh/gff3Parser.py", line 281, in _parseAttrs
    self.fileName, self.lineNumber)
ga4gh.gff3Parser.GFF3Exception: ref_GRCh37.p13_top_level.gff3:1050195: duplicated attribute name: idty

real    4m53.859s
user    4m45.524s
sys     0m8.308s
diekhans commented 8 years ago

The gff3 specification is unclear if one can have multiple attributes with the same name. For some attributes, it indicates they can be multi-valued with comma separators.

In this cases, the multiple idty are in the match record types, which which not be imported when creating sequence annotations.

Many a GFF file have been preprocessed with awk.

David Steinberg notifications@github.com writes:

RefSeq genes causes the below failure in generate_gff3_db.py script.

(serverenv)azureuser@ga4gh-david:~/server$ time python scripts/generate_gff3_db.py -i ref_GRCh37.p13_top_level.gff3 -o ref_GRCh37.p13_top_level.db Running GFF3 parser... Traceback (most recent call last): File "scripts/generate_gff3_db.py", line 134, in main() File "/home/azureuser/server/scripts/utils.py", line 34, in wrapper result = func(_args, *_kwargs) File "scripts/generate_gff3_db.py", line 130, in main g2d.run() File "scripts/generate_gff3_db.py", line 72, in run gff3Data = gff3.Gff3Parser(self.gff3File).parse() File "/home/azureuser/server/ga4gh/gff3Parser.py", line 335, in parse self._parseLine(gff3Set, line[0:-1]) File "/home/azureuser/server/ga4gh/gff3Parser.py", line 324, in _parseLine self._parseRecord(gff3Set, line) File "/home/azureuser/server/ga4gh/gff3Parser.py", line 303, in _parseRecord self._parseAttrs(row[8])) File "/home/azureuser/server/ga4gh/gff3Parser.py", line 281, in _parseAttrs self.fileName, self.lineNumber) ga4gh.gff3Parser.GFF3Exception: ref_GRCh37.p13_top_level.gff3:1050195: duplicated attribute name: idty

real 4m53.859s user 4m45.524s sys 0m8.308s

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub*

macieksmuga commented 8 years ago

@diekhans what's your suggested solution in our case?

diekhans commented 8 years ago

pre-filter to discard the match records as they have the problem and we don't want them anyway. Maybe just add a pre-filter function attribute to the parser.