clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Aliases in special section #124

Open herongrove opened 8 years ago

herongrove commented 8 years ago

Some papers, e.g. PMC2063868 and PMC2034365, have a special section naming their abbreviations. These should be exploited to inform the alias relations throughout the paper.

enoriega commented 8 years ago

Yes, we can read these sections before the rest to build the map of abbreviations. Are this titles (Notes, Footnotes, etc) attributes of a tag?

El 15/03/2016, a las 11:36 a.m., Dane Bell notifications@github.com escribió:

Some papers, e.g. PMC2063868 and PMC2034365, have a special section naming their abbreviations. These should be exploited to inform the alias relations throughout the paper.

The relations are very explicit, so we will benefit from relaxing our conservative requirement that both members be of the same label, and could also use the regularity to capture named entities such as simple chemicals that aren't being caught currently. Is it possible to ensure that these sections are read first? What are the titles that are used to name these sections in the nxml? So far: Notes, Footnotes, Abbreviations used in this paper, List of abbreviations, and one with no title, but interestingly with a footnote id that explains its contents in the nxml, namely . This might also be exploited. — You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/clulab/reach/issues/124

herongrove commented 8 years ago

I've seen four strategies:

  1. For Notes and Footnotes, it seems that they have dedicated tags, <notes> in the former case and <fn> in the latter. As far as I've seen, <fn> is always within <fn-group> and sometimes within <notes> as in <notes><fn-group><fn><p>Footnote here</p></fn></fn-group></notes>. Sometimes this includes the <def-list>, <term>, and <def> tags, as in `

    The abbreviations used are: FcϵR

    Fcϵ receptor

    ...`
  2. There's also the tag <glossary>, as in <glossary><title>Abbreviations used in this paper</title><def-list><def-item><term>LAT</term><def><p>linker for activation of T cells</p></def></def-item></def-list></glossary>, see PMC2193052
  3. <def-list> is also used without the <glossary> or the <fn> tag, as in <def-list><title>Abbreviations used in this paper:</title><def-item><term>GAP</term><def><p>GTPase-activating protein</p></def>...
  4. Sometimes it's just a <sec> tag with an informative title including the lemma abbreviation or glossary, as in <sec><title>List of abbreviations</title><p>Rb: Retinoblastoma; Cdk: Cyclin-dependent kinase;...