biothings / A BioThings API for human variant annotations
86 stars 32 forks source link

docm field formatting #166

Open colleenXu opened 1 year ago

colleenXu commented 1 year ago

when looking at docm data, the pubmed_id value is sometimes a list represented as a string. it appears to be ", "-delimited (both a comma AND a space).

It would be easier to use if it was represented as a list of strings. ex: [ "12460918", "23833300", "12068308", "12460919", "21483012", "19010912", "22649091", "19238210" ]

One example ``` { "_id": "chr7:g.140481393T>C", "_score": 1, "docm": { "aa_change": "p.Y472C", "all_domains": "pfam_Ser-Thr/Tyr_kinase_cat_dom,pfam_Prot_kinase_dom,pfam_Raf-like_ras-bd,pfam_Prot_Kinase_C-like_PE/DAG-bd,superfamily_Kinase-like_dom,smart_Raf-like_ras-bd,smart_Prot_Kinase_C-like_PE/DAG-bd,smart_Ser/Thr_dual-sp_kinase_dom,smart_Tyr_kinase_cat_dom,pfscan_Raf-like_ras-bd,pfscan_Prot_Kinase_C-like_PE/DAG-bd,pfscan_Prot_kinase_dom,prints_Ser-Thr/Tyr_kinase_cat_dom,prints_DAG/PE-bd", "alt": "C", "c_position": "c.1415", "chrom": 7, "default_gene_name": "BRAF", "deletion_substructures": "-", "disease": "LC", "doid": "DOID:1324", "domain": "pfam_Ser-Thr/Tyr_kinase_cat_dom,pfam_Prot_kinase_dom,superfamily_Kinase-like_dom,smart_Ser/Thr_dual-sp_kinase_dom,smart_Tyr_kinase_cat_dom,pfscan_Prot_kinase_dom", "ensembl_gene_id": "ENSG00000157764", "genename": "BRAF", "genename_source": "HGNC", "hg19": { "end": 140481393, "start": 140481393 }, "primary": 1, "pubmed_id": "12460918, 23833300, 12068308, 12460919, 21483012, 19010912, 22649091, 19238210", "ref": "T", "source": "MyCancerGenome", "strand": -1, "transcript_error": "no_errors", "transcript_name": "ENST00000288602", "transcript_source": "ensembl", "transcript_species": "human", "transcript_status": "known", "transcript_version": "74_37", "trv_type": "missense", "type": "SNP", "ucsc_cons": 1, "url": "" } } ```

EDIT: there are other fields that are also a little tricky to parse: