biolink / ontobio

python library for working with ontologies and ontology associations
https://ontobio.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
123 stars 30 forks source link

GPAD output from ontobio has evidence codes instead of ECO class IDs #201

Open dougli1sqrd opened 6 years ago

dougli1sqrd commented 6 years ago
WB  WBGene00011392  involved_in GO:0010466  PMID:9726255|WB_REF:WBPaper00003188 IDA         20090318    WB

Is an example of a gpad line from wb.gpad.

IDA should be an ECO id.

dougli1sqrd commented 6 years ago

https://github.com/biolink/ontobio/pull/202

dougli1sqrd commented 6 years ago

Number of genes is very close:

edouglass@Erics-MBP:~/lbl/geneontology/pipeline[testpypi_master ?]$ curl -L http://skyhook.berkeleybop.org/testpypi_master/annotations/wb.gaf.gz | gzip -dcf | cut -f 2 | sort | uniq | wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2043k  100 2043k    0     0   109k      0  0:00:18  0:00:18 --:--:-- 66616
   14090
edouglass@Erics-MBP:~/lbl/geneontology/pipeline[testpypi_master ?]$ curl -L http://skyhook.berkeleybop.org/testpypi_master/annotations/wb.gpad.gz | gzip -dcf | cut -f 2 | sort | uniq | wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1493k  100 1493k    0     0   107k      0  0:00:13  0:00:13 --:--:--  193k
   14079

Also, as far as IBA vs ECO:ECO:0000318:

edouglass@Erics-MBP:~/lbl/geneontology/pipeline[testpypi_master ?]$ curl -L http://skyhook.berkeleybop.org/testpypi_master/annotations/wb.gpad.gz | gzip -dcf | grep ECO:0000318 | wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1493k  100 1493k    0     0   351k      0  0:00:04  0:00:04 --:--:--  363k
   24087

@kltm what do you think?

kltm commented 6 years ago

Looks good from over here:

sjcarbon@moiraine:/tmp$:) zcat wb.gpad.gz | grep -v "^!" | cut -f 2 | sort | uniq | wc
  14078   14078  210682
sjcarbon@moiraine:/tmp$:) zcat wb.gaf.gz | grep -v "^!" | cut -f 2 | sort | uniq | wc
  14078   14078  210682