Content extraction - Githubissues

ljgarcia commented 10 years ago

Hi all,

We talked in our last hangout about extracting the structure of the articles so we can use them to specify where in the text an annotation occur. But, we did not agree on which content we want to work with. As I mentioned, formulas, tables, figures, footnotes, are tricky. So paragraphs should be the minimum. And maybe table content as well. Will we use an XSLT to extract the content in a plain way? I mean, removing format elements such as italic, bold, etc.

Cheers, lj

essepuntato commented 10 years ago

HI all,

text an annotation occur. But, we did not agree on which content we want to work with. As I mentioned, formulas, tables, figures, footnotes, are tricky. So paragraphs should be the minimum. And maybe table content as well. Will we use an XSLT to extract the content in a plain way? I mean, removing format elements such as italic, bold, etc.

Alex, what kinds of structure you would like to start to annotate? Are those introduced here enough? What about citations?

Have a nice day :-)

S.

ljgarcia commented 10 years ago

Hi Silvio,

Structure and annotations are two separate but related subjects. For PMC1087847 we have a section with title "Background", within a paragraph where we can identify the expression "amino acids" which is associated with the ChEBI term 33709.

From the structure XSLT we would come out with these triplets:

http://rdf.ncbi.nlm.nih.gov/pmc/PMC1087847 dcterms:hasPart http://rdf.ncbi.nlm.nih.gov/ section/pmc_resource/PMC1087847/Background

http://rdf.ncbi.nlm.nih.gov/section/pmc_resource/PMC1087847/Background dcterms:isPartOf http://rdf.ncbi.nlm.nih.gov/pmc/PMC1087847 dcterms:title "Background" rdf:type doco:Section

From the content XSLT that will not be used to produce RDF but only to extract the text we would get that for the section with title "Background" we have paragraph1 with plain text "...", paragraph2 with plain text "...", etc. If there are subsections, then it gets a bit more complicated as we want to know that the paragraph is in section "Background-SectionA" which is inside "Background". And we want to have that section-subsection relation as well in the RDF by means of isPartOf/hasPart

Then, we annotate the text, and get these triplets for the "amino acid" example, right now the triplets are using Annotation Ontolgy but we would move the the Open Annotation. http://rdf.ncbi.nlm.nih.gov/ annotation/pmc_resource/PMC1087847/ncbo-plus-an-id rdf:type aot:ExactQualifier ao:hasTopic chebi:33709 ao:context rdf:type biotea:ElementSelector dcterms:references http://rdf.ncbi.nlm.nih.gov/ section/pmc_resource/PMC1087847/Background ao:onresource http://rdf.ncbi.nlm.nih.gov/pmc/PMC1087847

Does it clarify the structure/annotation approach? Please let us know should you have further questions.

Thanks, Leyla Jael

On Mon, Feb 24, 2014 at 7:33 AM, S. notifications@github.com wrote:

HI all,

text an annotation occur. But, we did not agree on which content we want to work with. As I mentioned, formulas, tables, figures, footnotes, are tricky. So paragraphs should be the minimum. And maybe table content as well. Will we use an XSLT to extract the content in a plain way? I mean, removing format elements such as italic, bold, etc.

Alex, what kinds of structure you would like to start to annotate? Are those introduced here enough? What about citations?

Have a nice day :-)

S.

Reply to this email directly or view it on GitHubhttps://github.com/Klortho/eutils-org/issues/26#issuecomment-35863214 .

Klortho / eutils-org

Content extraction #26