Closed pnrobinson closed 4 years ago
@pnrobinson What should happen with annotations such as the third one in this set:
1618904 672 692 skeletal deformities Disease MESH:D009139
1618904 733 753 respiratory distress Disease MESH:D012128
1618904 561 578 chondrodysplastic Disease
1618904 1686 1703 chondrodysplasias Disease MESH:D010009
1618904 1031 1043 hypertrophic Disease MESH:D006984
which has start and end offsets, the term to be replaced, and the type of concept identified, but no concept identifier? I could certainly replace chondrodysplastic with Disease but I don't think that will result in greater clarity. I have not done any exploration of how frequently this sort of thing occurs. This instance is from the sample file which is only a very small percentage of the total corpus. Perhaps I should just ignore any concept annotation that lacks a unique ID for the concept?
@pnrobinson Another question that has come up: should I include the concept category (one of DNAMutation, CellLine, Chemical, Disease, Gene, ProteinMutation, Species) in the replacement text? For example, in the first annotation above should I replace skeletal deformities with MESH:D009139 or with something like [Disease MESH:D009139]? Both Chemicals and Diseases are identified by MeSH descriptors so it would be hard to distinguish the categories if you are just reading over the abstract after substituting the MeSH ids for English words. It might be a lot easier to keep track of what is going on if I preserve the category of the concept along with its identifier.
Ignoring items without a concept ID seems best!
Also, let's just replace with the concept ID, i.e., MESH:D009139 rather than [Disease MESH:D009139]. This is because downstream the word2vec will tokenise on whitespace and so all of the concepts would have a neighbor such as Disease which would add distance between the original concept and its neighbors.
@pnrobinson @LCCarmody could you please help me out with the concept ids that do not have a prefix in the pubtator offset file? Here are some examples of the four concept categories (Gene, Species, ProteinMutation, and DNAMutation) that do not have MESH: or similar prefix in the concept id (which is the final field of these lines):
1313814 1416 1419 NGF Gene 310738
1373142 549 555 canine Species 9615
1315314 1323 1337 Thr 588 to Asn ProteinMutation p.T588N
25497491 10707 10711 A53T DNAMutation c.53A>T;RS#:104893877
I guess perhaps ProteinMutation and DNAMutation are so distinctive in their format that no prefix is needed to tell them apart? But to tell the Genes from the Species I have to preface the numeric id with something. Would Entrez: be appropriate for the Genes and NCBI: be appropriate for the Species? Or would you prefer something different? Or do you want me to copy the offset file exactly without adding anything to the concept ids?
The gene prefix should be NCBIGene:310738 The species prefix should be NCBITaxon:9615
Note that the first two prefixes are standard, and there is no standard prefix for the mutation data. For this reason, I think we should just not replace the ProteinMutation and DNAMutation. If it is easier to replace them with something, it would not hurt to use the prefix "HGVS", but this would not add or subtract anything to the accuracy.
I guess I am not entirely sure what the numbers are above (which I guess is the point), but I'd need more context to determine what they are for.
That said, for Genes, I would prefer GeneID:XXXX (or EntrezGeneId:XXXX) because it is more descriptive. Also, Gene ID: XXX is what is used, even if it is not appended (https://www.ncbi.nlm.nih.gov/gene/5594).
One thing, Entrez and NCBI are used somewhat interchangeably so if you gave me NCBI:XXXX, I would have no idea what you were talking about.
I am not sure who assigns the species numbers. The only time I've come across them are in Uniprot.
As for DNA mutation number, I think...it really is the transcript number + the mutation that makes it distinctive.
@pnrobinson are you saying I should ignore any pubtator concept info relating to ProteinMutation or DNAMutation, skip over it entirely? I can do that if you want, I just need to be clear on your instructions. Or did you mean I should replace the abstract's text with whatever Pubtator says is the (Protein- or DNA-mutation) concept id and not worry about any prefix?
I would suggest leaving the ProteinMutation or DNAMutation items as is (do not change the original text for these items).
@pnrobinson Turns out the offset file from which the sample is drawn, bioconcepts2pubtatorcentral.offset.gz, has concept categories that do not appear in the sample file. Here's the complete list of categories:
CellLine
Chemical
DNAMutation
Disease
DomainMotif
Gene
Genus
ProteinMutation
SNP
Species
Strain
Of these, DomainMotif, Genus, SNP, and Strain are new. I do not have any examples of these concept categories to show you right now, could dig some up if you want to see them. Or perhaps you can decide without examples whether these concept categories require replacement, or shall I skip over them as I am skipping over DNAMutation and ProteinMutation?
I think that all of the other categories probably have a concept ID (DNAMutation and ProteinMutation do not), although I am not sure about Strain. Could you provide examples? In general though, if they show concept IDs from some terminology we should replace them.
@pnrobinson I was wrong about Genus and Strain, I was counting columns incorrectly and these do not appear in the concept category field. But DomainMotif and SNP do appear as concept categories, for example:
32267872 1968 1972 TEAD DomainMotif Focus:10090|7004;21679
32267872 4921 4925 TEAD DomainMotif Right:9606|7004;21679
32267872 8684 8688 TEAD DomainMotif Focus:10090|7004;21679
32267872 9577 9581 TEAD DomainMotif Left:10090|7004;21679
32267872 9873 9877 TEAD DomainMotif Focus:10090|7004;21679
32267872 9969 9973 TEAD DomainMotif Focus:10090|7004;21679
32267872 10303 10307 TEAD DomainMotif Focus:10090|7004;21679
32245207 39798 39806 Rs465100 SNP Rs465100
32245207 39857 39865 Rs479580 SNP Rs479580
32245207 39641 39649 Rs426380 SNP Rs426380
32273755 16417 16426 rs2228014 SNP rs2228014
32273755 17325 17334 rs2228014 SNP rs2228014
32273755 17482 17491 rs1801157 SNP rs1801157
32273755 17255 17264 rs1801157 SNP rs1801157
The final field on each line would be the string to substitute into the title or abstract of the article (analogous to the MeSH descriptor of a disease concept).
Please advise do you want me to make the substitution, ignore these concepts, or perhaps concatenate the concept category with the final field before substituting?
I think we should substitute the SNPs as SNP:rs2228014 the "Rs" should always be completely lower case, it is strange to see Rs465100 -> please transform this to rs465100
I am less sure of what the numbers are for the domain motif. 10090 is the taxon id for mouse and 9606 for h. sapiens. But I do not recognize the other numbers. I would say though that it might be OK just to substitute DomainMotif:TEAD
in cases like this. I do not think that we need to distinguish between species when we are talking about protein domains.
No problem for SNPs. What you are suggesting for DomainMotif
is to take the original text that occurs in the abstract and preface it with the label DomainMotif:
, bypassing entirely what Pubtator claims is the concept. This would make DomainMotif
a special case, different from all the other concept categories. Is that what you meant? I took a look in the first 12 million lines of the pubtator offset file to see what text is associated with the DomainMotif
category. Most of the DomainMotif
occurrences are not in the abstract but in the full text of the article (the offset file includes those concepts also). These occurrences won't show up in my output file because we are only replacing concepts in the abstracts (at least so far). However, I did not look beyond the head of the offset file so perhaps later in the file there are DomainMotif
instances in the abstract. The vast majority of DomainMotif
occurrences are associated with the text 'TEAD' but there are some others:
31761383 63620 63623 SOD DomainMotif Focus:2698737|20655;24786
27697545 7299 7311 "interferon" DomainMotif Focus:9606|56832
which would give you DomainMotif:SOD
and DomainMotif:"interferon"
(not sure why this has quotation marks, I assume that's how it appears in the Pubmed article). Please confirm this is what you want, thank you.
I do not understand what the numbers mean. For SOD, 20655 is the gene (mouse) and 24786 gene id (rat) https://pubchem.ncbi.nlm.nih.gov/gene/Sod1/mouse https://pubchem.ncbi.nlm.nih.gov/gene/24786 I cannot find any reference to 2698737.
The abstract does not appear to have any mention of SOD: https://pubmed.ncbi.nlm.nih.gov/31761383/
-- something seems wrong here
For interferon, 9606, https://pubmed.ncbi.nlm.nih.gov/27697545/, the abstract does not mention interferon, but the article does https://www.ncbi.nlm.nih.gov/research/pubtator/?view=docsum&query=27697545
-- again, something seems wrong here. I will mail the author
Decision for now is to skip over DomainMotif concepts, pending clarification from Zhiyong Lu. Will include SNP concepts as outlined above.
Working with new bioconcepts2pubtatorcentral.offset file dated 13 August 2020, which appears to have eliminated most of the incorrect offsets in the previous version. Decided to skip over replacements of NCBITaxon:9606 (human species) because it obscures distinctions of gender and age (men, women, boys, girls, children, patients all get replaced by the same concept id).
@pnrobinson @vidarmehr The new and improved offset file seems much better than the old one, but still has some weirdness. I did a search for MeSH ids that are preceded not by whitespace but by [a-zA-Z]. This search returned 2855 lines from a total of 2628734 lines (about 0.1%) in pubmed_cr.tsv (the output file from concept replacement). That's not the only form of weirdness in the file, but it's a form that is easy to identify. Many of these anomalies seem related to punctuation such as hyphens and quotation marks. I include below a few examples of what goes wrong. For each I have included the title+abstract after concept replacement, the corresponding entry in bioconcepts2pubtatorcentral.offset, and my notes in italics.
I will go back to padding each concept id with a space before and after, to at least separate the concept id from surrounding words. The extra spaces won't fix examples where the concept is misidentified, but at least they will eliminate gobbledygook such as repairMESH:D061325 or "MESH:D013256-inducedMESH:D003920.
18709565 2008 Constitutional mismatch repairMESH:D061325: have we so far seen only the tip of an iceberg? Heterozygous mutations in one of the mismatch repair (MMR) genes NCBIGene:4292, NCBIGene:4436, NCBIGene:2956 and NCBIGene:5395 cause the MESH:D009369 termed MESH:D003123 or MESH:D003123. During the past 10 years, some 35 reports have delineated the phenotype of NCBITaxon:9606 with biallelic inheritance of mutations in one of these MMR genes. The NCBITaxon:9606 suffer from a condition that is characterised by the development of MESH:D009369, mainly haematological MESH:D009369 and/or MESH:D001932, as well as early-onset MESH:D015179. Almost all NCBITaxon:9606 also show signs reminiscent of NCBIGene:4763, mainly cafe au lait spots. Alluding to the underlying mechanism, this condition may be termed as "constitutional mismatch repair-MESH:C565027". To give an overview of the current knowledge and its implications of this recessively inherited MESH:D009369 we summarise here the genetic, clinical and pathological findings of the so far 78 reported NCBITaxon:9606 of 46 families suffering from this syndrome.
18709565|t|Constitutional mismatch repair-deficiency syndrome: have we so far seen only the tip of an iceberg? 18709565|a|Heterozygous mutations in one of the mismatch repair (MMR) genes MLH1, MSH2, MSH6 and PMS2 cause the dominant adult cancer syndrome termed Lynch syndrome or hereditary non-polyposis colorectal cancer. During the past 10 years, some 35 reports have delineated the phenotype of patients with biallelic inheritance of mutations in one of these MMR genes. The patients suffer from a condition that is characterised by the development of childhood cancers, mainly haematological malignancies and/or brain tumours, as well as early-onset colorectal cancers. Almost all patients also show signs reminiscent of neurofibromatosis type 1, mainly cafe au lait spots. Alluding to the underlying mechanism, this condition may be termed as "constitutional mismatch repair-deficiency (CMMR-D) syndrome". To give an overview of the current knowledge and its implications of this recessively inherited cancer syndrome we summarise here the genetic, clinical and pathological findings of the so far 78 reported patients of 46 families suffering from this syndrome. 18709565 30 50 -deficiency syndrome Disease MESH:D061325 18709565 165 169 MLH1 Gene 4292 18709565 171 175 MSH2 Gene 4436 18709565 177 181 MSH6 Gene 2956 18709565 186 190 PMS2 Gene 5395 18709565 201 231 dominant adult cancer syndrome Disease MESH:D009369 18709565 239 253 Lynch syndrome Disease MESH:D003123 18709565 257 299 hereditary non-polyposis colorectal cancer Disease MESH:D003123 18709565 376 384 patients Species 9606 18709565 456 464 patients Species 9606 18709565 533 550 childhood cancers Disease MESH:D009369 18709565 574 586 malignancies Disease MESH:D009369 18709565 594 607 brain tumours Disease MESH:D001932 18709565 632 650 colorectal cancers Disease MESH:D015179 18709565 663 671 patients Species 9606 18709565 703 727 neurofibromatosis type 1 Gene 4763 18709565 858 886 deficiency (CMMR-D) syndrome Disease MESH:C565027 18709565 985 1000 cancer syndrome Disease MESH:D009369 18709565 1093 1101 patients Species 9606
D061325 Hereditary Breast and Ovarian Cancer Syndrome C565027 Complement Factor D Deficiency (seems to be immunologic deficiency) no MeSH term for "constitutional mismatch repair-deficiency syndrome"
20186688 2010 Quantification of sequence exchange events between NCBIGene:5395 and NCBIGene:441194 provides a basis for improved mutation scanning of MESH:D003123 NCBITaxon:9606. Heterozygous mutations in NCBIGene:5395 are involved in MESH:D003123, whereas biallelic mutations are found in Constitutional mismatch repairMESH:D061325 NCBITaxon:9606. Mutation detection is complicated by the occurrence of sequence exchange events between the duplicated regions of NCBIGene:5395 and NCBIGene:441194. We investigated the frequency of such events with a nonspecific polymerase chain reaction (PCR) strategy, co-amplifying both NCBIGene:5395 and NCBIGene:441194 sequences. This allowed us to score ratios between gene and pseudogene-specific nucleotides at 29 PSV sites from exon 11 to the end of the gene. We found sequence transfer at all investigated PSVs from intron 12 to the 3' end of the gene in 4 to 52% of DNA samples. Overall, sequence exchange between NCBIGene:5395 and NCBIGene:441194 was observed in 69% (83/120) of individuals. We demonstrate that mutation scanning with NCBIGene:5395-specific PCR primers and MLPA probes, designed on PSVs, in the 3' duplicated region is unreliable, and present an RNA-based mutation detection strategy to improve reliability. Using this strategy, we found 19 different putative pathogenic NCBIGene:5395 mutations. Four of these (21%) are lying in the region with frequent sequence transfer and are missed or called incorrectly as homozygous with several PSV-based mutation detection methods.
20186688|t|Quantification of sequence exchange events between PMS2 and PMS2CL provides a basis for improved mutation scanning of Lynch syndrome patients. 20186688|a|Heterozygous mutations in PMS2 are involved in Lynch syndrome, whereas biallelic mutations are found in Constitutional mismatch repair-deficiency syndrome patients. Mutation detection is complicated by the occurrence of sequence exchange events between the duplicated regions of PMS2 and PMS2CL. We investigated the frequency of such events with a nonspecific polymerase chain reaction (PCR) strategy, co-amplifying both PMS2 and PMS2CL sequences. This allowed us to score ratios between gene and pseudogene-specific nucleotides at 29 PSV sites from exon 11 to the end of the gene. We found sequence transfer at all investigated PSVs from intron 12 to the 3' end of the gene in 4 to 52% of DNA samples. Overall, sequence exchange between PMS2 and PMS2CL was observed in 69% (83/120) of individuals. We demonstrate that mutation scanning with PMS2-specific PCR primers and MLPA probes, designed on PSVs, in the 3' duplicated region is unreliable, and present an RNA-based mutation detection strategy to improve reliability. Using this strategy, we found 19 different putative pathogenic PMS2 mutations. Four of these (21%) are lying in the region with frequent sequence transfer and are missed or called incorrectly as homozygous with several PSV-based mutation detection methods. 20186688 51 55 PMS2 Gene 5395 20186688 60 66 PMS2CL Gene 441194 20186688 118 132 Lynch syndrome Disease MESH:D003123 20186688 133 141 patients Species 9606 20186688 169 173 PMS2 Gene 5395 20186688 190 204 Lynch syndrome Disease MESH:D003123 20186688 277 297 -deficiency syndrome Disease MESH:D061325 20186688 298 306 patients Species 9606 20186688 422 426 PMS2 Gene 5395 20186688 431 437 PMS2CL Gene 441194 20186688 564 568 PMS2 Gene 5395 20186688 573 579 PMS2CL Gene 441194 20186688 881 885 PMS2 Gene 5395 20186688 890 896 PMS2CL Gene 441194 20186688 985 989 PMS2 Gene 5395 20186688 1229 1233 PMS2 Gene 5395
D061325 Hereditary Breast and Ovarian Cancer Syndrome
12460054 2002 Severe MESH:D006943 after renal transplantation in a pediatric NCBITaxon:9606 with a mutation of the NCBIGene:6928 gene. After renal transplantation for MESH:D052177 of unknown origin, a 14-year-old NCBITaxon:9606, who was previously normoglycemic, had "MESH:D013256-inducedMESH:D003920, which was treated with NCBIGene:3630. Transplant failure from chronic rejection and subsequent transplant nephrectomy allowed discontinuation of corticosteroids, the gradual withdrawal of MESH:D007333. The recent description of MESH:D003928 and a strong paternal family history of early-onset MESH:D003920 prompted genetic screening of the NCBIGene:6928 gene. A novel heterozygous frameshift mutation in exon 1 was identified, adding to the 12 kindreds thus far described. This case highlights the unmasking of the MESH:C535520 in the immediate postoperative period after renal transplantation and emphasizes the pleiotropic manifestations of this important MESH:D030342.
12460054|t|Severe hyperglycemia after renal transplantation in a pediatric patient with a mutation of the hepatocyte nuclear factor-1beta gene. 12460054|a|After renal transplantation for congenital cystic kidney disease of unknown origin, a 14-year-old boy, who was previously normoglycemic, had "steroid-induced" diabetes mellitus, which was treated with insulin. Transplant failure from chronic rejection and subsequent transplant nephrectomy allowed discontinuation of corticosteroids, the gradual withdrawal of insulin and normoglycemia. The recent description of renal cysts and diabetes (RCAD) syndrome and a strong paternal family history of early-onset diabetes mellitus prompted genetic screening of the hepatocyte nuclear factor-1beta gene. A novel heterozygous frameshift mutation in exon 1 was identified, adding to the 12 kindreds thus far described. This case highlights the unmasking of the hyperglycemic component of the RCAD syndrome in the immediate postoperative period after renal transplantation and emphasizes the pleiotropic manifestations of this important genetic kidney disease. 12460054 7 20 hyperglycemia Disease MESH:D006943 12460054 64 71 patient Species 9606 12460054 95 126 hepatocyte nuclear factor-1beta Gene 6928 12460054 165 197 congenital cystic kidney disease Disease MESH:D052177 12460054 231 234 boy Species 9606 12460054 275 282 steroid Chemical MESH:D013256 12460054 290 309 " diabetes mellitus Disease MESH:D003920 12460054 334 341 insulin Gene 3630 12460054 493 518 insulin and normoglycemia Disease MESH:D007333 12460054 546 586 renal cysts and diabetes (RCAD) syndrome Disease MESH:D003928 12460054 639 656 diabetes mellitus Disease MESH:D003920 12460054 691 722 hepatocyte nuclear factor-1beta Gene 6928 12460054 884 928 hyperglycemic component of the RCAD syndrome Disease MESH:C535520 12460054 1059 1081 genetic kidney disease Disease MESH:D030342
26973108 2016 MESH:D004298 agonists rescue Abeta-induced LTP impairment by Src-family tyrosine kinases. Soluble forms of oligomeric amyloid beta (AbetaO) are involved in the loss of synaptic plasticity and memory, especially in early phases of MESH:D000544. Stimulation of dopamine D1/D5 receptors (D1R/D5R) is known to increase surface expression of synaptic alpha-amino-3-hydroxyl-5-methyl-4-isoxazoleMESH:D011422 subtype glutamate and N-methyl-D-aspartate subtype glutamate receptors and facilitates the induction of the late phase of long-term potentiation (LTP), probably via a related mechanism. In this study, we show that the D1/D5R agonist MESH:D015647 protects LTP of hippocampal NCBIGene:759 synapses from the deleterious action of oligomeric amyloid beta. Unexpectedly, the D1R/D5R-mediated recovery of LTP is independent of protein kinase A or phospholipase C pathways. Instead, we found that the inhibition of Src-family tyrosine kinases completely abolished the protective effects of D1R/D5R stimulation in a cellular model of learning and memory.
26973108|t|Dopamine agonists rescue Abeta-induced LTP impairment by Src-family tyrosine kinases. 26973108|a|Soluble forms of oligomeric amyloid beta (AbetaO) are involved in the loss of synaptic plasticity and memory, especially in early phases of Alzheimer's disease. Stimulation of dopamine D1/D5 receptors (D1R/D5R) is known to increase surface expression of synaptic alpha-amino-3-hydroxyl-5-methyl-4-isoxazole-propionate subtype glutamate and N-methyl-D-aspartate subtype glutamate receptors and facilitates the induction of the late phase of long-term potentiation (LTP), probably via a related mechanism. In this study, we show that the D1/D5R agonist SKF38393 protects LTP of hippocampal CA1 synapses from the deleterious action of oligomeric amyloid beta. Unexpectedly, the D1R/D5R-mediated recovery of LTP is independent of protein kinase A or phospholipase C pathways. Instead, we found that the inhibition of Src-family tyrosine kinases completely abolished the protective effects of D1R/D5R stimulation in a cellular model of learning and memory.
26973108 0 8 Dopamine Chemical MESH:D004298 26973108 25 30 Abeta Chemical - 26973108 226 245 Alzheimer's disease Disease MESH:D000544 26973108 392 403 -propionate Chemical MESH:D011422 26973108 637 645 SKF38393 Chemical MESH:D015647 26973108 674 677 CA1 Gene 759
Thanks, that sounds good!
Also, of course, the downstream analysis will split concepts such as MeSH:1234
into two words" MeSH
and 1234
. Therefore we should replace MeSH:1234 by MeSH_1234
Not yet complete, but the concept replacement code now handles colons within concept ids (replaced with underscore), semicolons between ids (separated into distinct concepts), and extra whitespace (one space before and after each replacement). DNAMutation, ProteinMutation, and Species 9606 (human) are excluded from replacement.
The concept ids now have no punctuation at all (e.g. NCBIGene9355 or MESHD012516) to make sure the word2vec algorithm sees them as a single token and does not break them into multiple pieces.
Completed with pull request #31
ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral
Add support to replace English text with the concepts using the offsets provided by PubTator.