MouYongli / MedKGC

MIT License
1 stars 0 forks source link

Meeting about NER #2

Closed JohannHalley closed 6 months ago

JohannHalley commented 7 months ago

NER

  1. TokenClassification https://huggingface.co/docs/transformers/tasks/token_classification

  2. Background report

  3. try NER

  4. view the code of BERN2

Models

run these model

Dataset

run the model, and check the output schema.

JohannHalley commented 7 months ago

today,

download NCBI diseases: traininig, test, dev

snippet

10021369 Identification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor . The adenomatous polyposis coli ( APC ) tumour-suppressor protein controls the Wnt signalling pathway by forming a complex with glycogen synthase kinase 3beta ( GSK-3beta ) , axin / conductin and betacatenin . Complex formation induces the rapid degradation of betacatenin . In colon carcinoma cells , loss of APC leads to the accumulation of betacatenin in the nucleus , where it binds to and activates the Tcf-4 transcription factor ( reviewed in [ 1 ] [ 2 ] ) . Here , we report the identification and genomic structure of APC homologues . Mammalian APC2 , which closely resembles APC in overall domain structure , was functionally analyzed and shown to contain two SAMP domains , both of which are required for binding to conductin . Like APC , APC2 regulates the formation of active betacatenin-Tcf complexes , as demonstrated using transient transcriptional activation assays in APC - / - colon carcinoma cells . Human APC2 maps to chromosome 19p13 . 3 . APC and APC2 may therefore have comparable functions in development and cancer . 10051005 A common MSH2 mutation in English and North American HNPCC families: origin, phenotypic expression, and sex specific differences in colorectal cancer . The frequency , origin , and phenotypic expression of a germline MSH2 gene mutation previously identified in seven kindreds with hereditary non-polyposis cancer syndrome (HNPCC) was investigated . The mutation ( A-- > T at nt943 + 3 ) disrupts the 3 splice site of exon 5 leading to the deletion of this exon from MSH2 mRNA and represents the only frequent MSH2 mutation so far reported . Although this mutation was initially detected in four of 33 colorectal cancer families analysed from eastern England , more extensive analysis has reduced the frequency to four of 52 ( 8 % ) English HNPCC kindreds analysed . In contrast , the MSH2 mutation was identified in 10 of 20 ( 50 % ) separately identified colorectal families from Newfoundland . To investigate the origin of this mutation in colorectal cancer families from England ( n = 4 ) , Newfoundland ( n = 10 ) , and the United States ( n = 3 ) , haplotype analysis using microsatellite markers linked to MSH2 was performed . Within the English and US families there was little evidence for a recent common origin of the MSH2 splice site mutation in most families . In contrast , a common haplotype was identified at the two flanking markers ( CA5 and D2S288 ) in eight of the Newfoundland families . These findings suggested a founder effect within Newfoundland similar to that reported by others for two MLH1 mutations in Finnish HNPCC families . We calculated age related risks of all , colorectal , endometrial , and ovarian cancers in nt943 + 3 A-- > T MSH2 mutation carriers ( n = 76 ) for all patients and for men and women separately . For both sexes combined , the penetrances at age 60 years for all cancers and for colorectal cancer were 0 . 86 and 0 . 57 , respectively . The risk of colorectal cancer was significantly higher ( p < 0 . 01 ) in males than females ( 0 . 63 v 0 . 30 and 0 . 84 v 0 . 44 at ages 50 and 60 years , respectively ) . For females there was a high risk of endometrial cancer ( 0 . 5 at age 60 years ) and premenopausal ovarian cancer ( 0 . 2 at 50 years ) . These intersex differences in colorectal cancer risks have implications for screening programmes and for attempts to identify colorectal cancer susceptibility modifiers .

Input

Input of BERN2

preprocess

# Write input str to a .PubTator format file
with open(input_gnormplus, 'w', encoding='utf-8') as f:
    # only abstract
    f.write(f'{base_name}|t|\n')
    f.write(f'{base_name}|a|{text}\n\n')

plain text

Autophagy maintains tumour growth through circulating arginine. Autophagy captures intracellular components and delivers them to lysosomes, where they are degraded and recycled to sustain metabolism and to enable survival during starvation1-5. Acute, whole-body deletion of the essential autophagy gene Atg7 in adult mice causes a systemic metabolic defect that manifests as starvation intolerance and gradual loss of white adipose tissue, liver glycogen and muscle mass1. Cancer cells also benefit from autophagy.

Write input str to a .PubTator format file c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976 is a random string for title

c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc|t| c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc|a|Autophagy maintains tumour growth through circulating arginine. Autophagy captures intracellular components and delivers them to lysosomes, where they are degraded and recycled to sustain metabolism and to enable survival during starvation1-5. Acute, whole-body deletion of the essential autophagy gene Atg7 in adult mice causes a systemic metabolic defect that manifests as starvation intolerance and gradual loss of white adipose tissue, liver glycogen and muscle mass1. Cancer cells also benefit from autophagy.

output

99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc|t| c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc|a|Autophagy maintains tumour growth through circulating arginine. Autophagy captures intracellular components and delivers them to lysosomes, where they are degraded and recycled to sustain metabolism and to enable survival during starvation1-5. Acute, whole-body deletion of the essential autophagy gene Atg7 in adult mice causes a systemic metabolic defect that manifests as starvation intolerance and gradual loss of white adipose tissue, liver glycogen and muscle mass1. Cancer cells also benefit from autophagy. c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc 304 308 Atg7 Gene 10533 c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc 467 472 mass1 Gene 84059 c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc 318 322 mice Species 10090

JohannHalley commented 6 months ago

NCBI devel

8808605|t|Somatic-cell selection is a major determinant of the blood-cell phenotype in heterozygotes for glucose-6-phosphate dehydrogenase mutations causing severe enzyme deficiency. 8808605|a|X-chromosome inactivation in mammals is regarded as an essentially random process, but the resulting somatic-cell mosaicism creates the opportunity for cell selection. In most people with red-blood-cell glucose-6-phosphate dehydrogenase (G6PD) deficiency, the enzyme-deficient phenotype is only moderately expressed in nucleated cells. However, in a small subset of hemizygous males who suffer from chronic nonspherocytic hemolytic anemia, the underlying mutations (designated class I) cause more-severe G6PD deficiency, and this might provide an opportunity for selection in heterozygous females during development. In order to test this possibility we have analyzed four heterozygotes for class I G6PD mutations two with G6PD Portici (1178G-- > A) and two with G6PD Bari (1187C-- > T). We found that in fractionated blood cell types (including erythroid, myeloid, and lymphoid cell lineages) there was a significant excess of G6PD-normal cells. The significant concordance that we have observed in the degree of imbalance in the different blood-cell lineages indicates that a selective mechanism is likely to operate at the level of pluripotent blood stem cells. Thus, it appears that severe G6PD deficiency affects adversely the proliferation or the survival of nucleated blood cells and that this phenotypic characteristic is critical during hematopoiesis.. 8808605 154 171 enzyme deficiency DiseaseClass D008661 8808605 376 427 glucose-6-phosphate dehydrogenase (G6PD) deficiency SpecificDisease D005955 8808605 572 611 chronic nonspherocytic hemolytic anemia SpecificDisease D000746 8808605 677 692 G6PD deficiency SpecificDisease D005955 8808605 1368 1383 G6PD deficiency SpecificDisease D005955 9050866|t|The ataxia-telangiectasia gene product, a constitutively expressed nuclear protein that is not up-regulated following genome damage. 9050866|a|The product of the ataxia-telangiectasia gene (ATM) was identified by using an antiserum developed to a peptide corresponding to the deduced amino acid sequence. The ATM protein is a single, high-molecular weight protein predominantly confined to the nucleus of human fibroblasts, but is present in both nuclear and microsomal fractions from human lymphoblast cells and peripheral blood lymphocytes. ATM protein levels and localization remain constant throughout all stages of the cell cycle. Truncated ATM protein was not detected in lymphoblasts from ataxia-telangiectasia patients homozygous for mutations leading to premature protein termination. Exposure of normal human cells to gamma-irradiation and the radiomimetic drug neocarzinostatin had no effect on ATM protein levels, in contrast to a noted rise in p53 levels over the same time interval. These findings are consistent with a role for the ATM protein in ensuring the fidelity of DNA repair and cell cycle regulation following genome damage.. 9050866 4 25 ataxia-telangiectasia Modifier D001260 9050866 152 173 ataxia-telangiectasia Modifier D001260 9050866 686 707 ataxia-telangiectasia Modifier D001260

JohannHalley commented 6 months ago

OVERLAPPING ENTITIES

bern Donghyeon Kim et al., “A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining,” IEEE Access 7 (2019): 73729–40, https://doi.org/10.1109/ACCESS.2019.2920708.

image image
JohannHalley commented 6 months ago

format

PubAnnotation JSON

MIMIC-CXR

{'text': 'FINAL REPORT EXAMINATION : CHEST ( PORTABLE AP ) INDICATION : year old woman with SAH / / Fever workup Fever workup IMPRESSION : Compared to chest radiographs . Patient has been extubated . Lungs are clear . Normal cardiomediastinal and hilar silhouettes and pleural surfaces .', 'entities': {'1': {'tokens': 'Lungs', 'label': 'ANAT-DP', 'start_ix': 36, 'end_ix': 36, 'relations': []}, '2': {'tokens': 'clear', 'label': 'OBS-DP', 'start_ix': 38, 'end_ix': 38, 'relations': [['located_at', '1']]}, '3': {'tokens': 'Normal', 'label': 'OBS-DP', 'start_ix': 40, 'end_ix': 40, 'relations': [['located_at', '4'], ['located_at', '5'], ['located_at', '7']]}, '4': {'tokens': 'cardiomediastinal', 'label': 'ANAT-DP', 'start_ix': 41, 'end_ix': 41, 'relations': []}, '5': {'tokens': 'hilar', 'label': 'ANAT-DP', 'start_ix': 43, 'end_ix': 43, 'relations': []}, '6': {'tokens': 'silhouettes', 'label': 'ANAT-DP', 'start_ix': 44, 'end_ix': 44, 'relations': [['modify', '4'], ['modify', '5']]}, '7': {'tokens': 'pleural', 'label': 'ANAT-DP', 'start_ix': 46, 'end_ix': 46, 'relations': []}, '8': {'tokens': 'surfaces', 'label': 'ANAT-DP', 'start_ix': 47, 'end_ix': 47, 'relations': [['modify', '7']]}}, 'data_source': 'MIMIC-CXR', 'data_split': 'train'}

PubTator format

99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc|t| c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc|a|Autophagy maintains tumour growth through circulating arginine. Autophagy captures intracellular components and delivers them to lysosomes, where they are degraded and recycled to sustain metabolism and to enable survival during starvation1-5. Acute, whole-body deletion of the essential autophagy gene Atg7 in adult mice causes a systemic metabolic defect that manifests as starvation intolerance and gradual loss of white adipose tissue, liver glycogen and muscle mass1. Cancer cells also benefit from autophagy. > c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc 304 308 Atg7 Gene 10533 c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc 467 472 mass1 Gene 84059 c99e21c6fda4e43e006e4e925a9a4f04fe1f631c976b00ef7a4abecc 318 322 mice Species 10090

JohannHalley commented 6 months ago

CoNLL_tokenizer

Donghyeon Kim et al., “A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining,” IEEE Access 7 (2019): 73729–40, https://doi.org/10.1109/ACCESS.2019.2920708.

Words in a sentence are obtained using a tokenizer on a dataset with labels in CoNLL format [46] and then the sub-words of each word are obtained using the WordPiece tokenizer.