HW-SWeL / BMUSE

Bioschemas Mark Up Scraper and Extractor
https://app.swaggerhub.com/apis-docs/swel/BMUSE/
Apache License 2.0
3 stars 5 forks source link

rdfa doesn't scrape well #11

Open kcmcleod opened 5 years ago

kcmcleod commented 5 years ago

When properties are nested, the inner properties are removed to form triples leaving the outer property looking rather messy. Eg, from https://www.uniprot.org/uniprot/Q62226 :

<div class="annotation" property="hasPart" typeof="CreativeWork">
<span property="text">Sonic hedgehog protein: The C-terminal part of the sonic hedgehog protein precursor displays an autoproteolysis and a cholesterol transferase activity (PubMed:
<a href="/citations/8824192">8824192</a>, PubMed:
<a href="/citations/7891723">7891723</a>). Both activities result in the cleavage of the full-length protein into two parts (ShhN and ShhC) followed by the covalent attachment of a cholesterol moiety to the C-terminal of the newly generated ShhN (PubMed:
<a href="/citations/8824192">8824192</a>). Both activities occur in the reticulum endoplasmic (PubMed:
<a href="/citations/21357747">21357747</a>). Once cleaved, ShhC is degraded in the endoplasmic reticulum (PubMed:
<a href="/citations/21357747">21357747</a>).
<span class="attribution ECO305">
<span class="attributionHeader ">1 Publication
<span class="showHideEvidence caret_grey displayThisInline"></span>
    </span>
    <span style="display:none" class="evidenceContainer">
<p class="attributionExplain">
<span class="context-help tooltipped-click html tipId-1">
<span style="display:none">
<span class="toolTipContent">&#xd; &lt;p>Manually curated information which has been inferred by a curator based on his/her scientific knowledge or on the scientific content of an article.&lt;/p>&#xd; &lt;p>&lt;a href="/manual/evidences#ECO:0000305">More...&lt;/a>&lt;/p>&#xd;
</span>
    </span>Manual assertion inferred by curator from
    <sup>i</sup>
    </span>
    </p>
    <ul>
        <li>
            <div class="Q62226#ref18 referenceAttribution">
                <div class="reference_header">Ref.18</div>
                <div class="reference_content">
                    <div property="citation" resource="http://purl.uniprot.org/citations/21357747" typeof="ScholarlyArticle">
                        <strong property="name">"Processing and turnover of the Hedgehog protein in the endoplasmic reticulum."</strong>
                        <br/>
                        <a href="/uniprot/?query=author:%22Chen+X.%22&amp;sort=score" rel="nofollow">Chen X.</a>,
                        <a href="/uniprot/?query=author:%22Tukachinsky+H.%22&amp;sort=score" rel="nofollow">Tukachinsky H.</a>,
                        <a href="/uniprot/?query=author:%22Huang+C.H.%22&amp;sort=score" rel="nofollow">Huang C.H.</a>,
                        <a href="/uniprot/?query=author:%22Jao+C.%22&amp;sort=score" rel="nofollow">Jao C.</a>,
                        <a href="/uniprot/?query=author:%22Chu+Y.R.%22&amp;sort=score" rel="nofollow">Chu Y.R.</a>,
                        <a href="/uniprot/?query=author:%22Tang+H.Y.%22&amp;sort=score" rel="nofollow">Tang H.Y.</a>,
                        <a href="/uniprot/?query=author:%22Mueller+B.%22&amp;sort=score" rel="nofollow">Mueller B.</a>,
                        <a href="/uniprot/?query=author:%22Schulman+S.%22&amp;sort=score" rel="nofollow">Schulman S.</a>,
                        <a href="/uniprot/?query=author:%22Rapoport+T.A.%22&amp;sort=score" rel="nofollow">Rapoport T.A.</a>,
                        <a href="/uniprot/?query=author:%22Salic+A.%22&amp;sort=score" rel="nofollow">Salic A.</a>
                        <br/>
                        <a href="http://dx.doi.org/10.1083/jcb.201008090">J. Cell Biol. 192:825-838(2011)</a> [
                        <a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/21357747">PubMed</a>] [
                        <a property="sameAs" href="https://europepmc.org/abstract/MED/21357747">Europe PMC</a>] [
                        <a href="/citations/21357747">Abstract</a>]
                    </div>
                    <div class="citedFor">
                        <span class="details">
<strong>Cited for:</strong>
</span> REVIEW, FUNCTION.
                    </div>
                </div>
            </div>
        </li>
    </ul>
    </span>
    </span>
    <span class="attribution ECO269">
<span class="attributionHeader ">3 Publications
<span class="showHideEvidence caret_grey displayThisInline"></span>
    </span>
    <span style="display:none" class="evidenceContainer">
<p class="attributionExplain">
<span class="context-help tooltipped-click html tipId-2">
<span style="display:none">
<span class="toolTipContent">&#xd; &lt;p>Manually curated information for which there is published experimental evidence.&lt;/p>&#xd;
&lt;p>&lt;a href="/manual/evidences#ECO:0000269">More...&lt;/a>&lt;/p>&#xd;
</span>
    </span>Manual assertion based on experiment in
    <sup>i</sup>
    </span>
    </p>
    <ul>
        <li>
            <div class="Q62226#ref6 referenceAttribution">
                <div class="reference_header">Ref.6</div>
                <div class="reference_content">
                    <div property="citation" resource="http://purl.uniprot.org/citations/7891723" typeof="ScholarlyArticle">
                        <strong property="name">"Proteolytic processing yields two secreted forms of sonic hedgehog."</strong>
                        <br/>
                        <a href="/uniprot/?query=author:%22Bumcrot+D.A.%22&amp;sort=score" rel="nofollow">Bumcrot D.A.</a>,
                        <a href="/uniprot/?query=author:%22Takada+R.%22&amp;sort=score" rel="nofollow">Takada R.</a>,
                        <a href="/uniprot/?query=author:%22McMahon+A.P.%22&amp;sort=score" rel="nofollow">McMahon A.P.</a>
                        <br/>
                        <a href="http://dx.doi.org/10.1128/MCB.15.4.2294">Mol. Cell. Biol. 15:2294-2303(1995)</a> [
                        <a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/7891723">PubMed</a>] [
                        <a property="sameAs" href="https://europepmc.org/abstract/MED/7891723">Europe PMC</a>] [
                        <a href="/citations/7891723">Abstract</a>]
                    </div>
                    <div class="citedFor">
                        <span class="details">
<strong>Cited for:</strong>
</span> PROTEOLYTIC PROCESSING, GLYCOSYLATION, SUBCELLULAR LOCATION.
                    </div>
                </div>
            </div>
        </li>
        <li>
            <div class="Q62226#ref7 referenceAttribution">
                <div class="reference_header">Ref.7</div>
                <div class="reference_content">
                    <div property="citation" resource="http://purl.uniprot.org/citations/7736596" typeof="ScholarlyArticle">
                        <strong property="name">"Floor plate and motor neuron induction by different concentrations of the amino-terminal cleavage product of sonic hedgehog autoproteolysis."</strong>
                        <br/>
                        <a href="/uniprot/?query=author:%22Roelink+H.%22&amp;sort=score" rel="nofollow">Roelink H.</a>,
                        <a href="/uniprot/?query=author:%22Porter+J.A.%22&amp;sort=score" rel="nofollow">Porter J.A.</a>,
                        <a href="/uniprot/?query=author:%22Chiang+C.%22&amp;sort=score" rel="nofollow">Chiang C.</a>,
                        <a href="/uniprot/?query=author:%22Tanabe+Y.%22&amp;sort=score" rel="nofollow">Tanabe Y.</a>,
                        <a href="/uniprot/?query=author:%22Chang+D.T.%22&amp;sort=score" rel="nofollow">Chang D.T.</a>,
                        <a href="/uniprot/?query=author:%22Beachy+P.A.%22&amp;sort=score" rel="nofollow">Beachy P.A.</a>,
                        <a href="/uniprot/?query=author:%22Jessell+T.M.%22&amp;sort=score" rel="nofollow">Jessell T.M.</a>
                        <br/>
                        <a href="http://dx.doi.org/10.1016/0092-8674(95)90397-6">Cell 81:445-455(1995)</a> [
                        <a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/7736596">PubMed</a>] [
                        <a property="sameAs" href="https://europepmc.org/abstract/MED/7736596">Europe PMC</a>] [
                        <a href="/citations/7736596">Abstract</a>]
                    </div>
                    <div class="citedFor">
                        <span class="details">
<strong>Cited for:</strong>
</span> FUNCTION, PROTEOLYTIC PROCESSING, AUTOCATALYTIC CLEAVAGE.
                    </div>
                </div>
            </div>
        </li>
        <li>
            <div class="Q62226#ref8 referenceAttribution">
                <div class="reference_header">Ref.8</div>
                <div class="reference_content">
                    <div property="citation" resource="http://purl.uniprot.org/citations/8824192" typeof="ScholarlyArticle">
                        <strong property="name">"Cholesterol modification of hedgehog signaling proteins in animal development."</strong>
                        <br/>
                        <a href="/uniprot/?query=author:%22Porter+J.A.%22&amp;sort=score" rel="nofollow">Porter J.A.</a>,
                        <a href="/uniprot/?query=author:%22Young+K.E.%22&amp;sort=score" rel="nofollow">Young K.E.</a>,
                        <a href="/uniprot/?query=author:%22Beachy+P.A.%22&amp;sort=score" rel="nofollow">Beachy P.A.</a>
                        <br/>
                        <a href="http://dx.doi.org/10.1126/science.274.5285.255">Science 274:255-259(1996)</a> [
                        <a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/8824192">PubMed</a>] [
                        <a property="sameAs" href="https://europepmc.org/abstract/MED/8824192">Europe PMC</a>] [
                        <a href="/citations/8824192">Abstract</a>]
                    </div>
                    <div class="citedFor">
                        <span class="details">
                                                                <strong>Cited for:</strong>
                                                            </span> CHOLESTERYLATION AT GLY-198, FUNCTION.
                    </div>
                </div>
            </div>
        </li>
    </ul>
    </span>
    </span>
    </span>
</div>

The triple representing the text property (in the 2nd line) ends up as:

http://bioschemas.org/crawl/v1/28/www.uniprot.org/uniprot/Q62226/781026336 http://schema.org/text  "Sonic hedgehog protein: The C-terminal part of the sonic hedgehog protein precursor displays an autoproteolysis and a cholesterol transferase activity (PubMed:8824192, PubMed:7891723). Both activities result in the cleavage of the full-length protein into two parts (ShhN and ShhC) followed by the covalent attachment of a cholesterol moiety to the C-terminal of the newly generated ShhN (PubMed:8824192). Both activities occur in the reticulum endoplasmic (PubMed:21357747). Once cleaved, ShhC is degraded in the endoplasmic reticulum (PubMed:21357747).1 Publication <p>Manually curated information which has been inferred by a curator based on his/her scientific knowledge or on the scientific content of an article.</p> <p><a href="/manual/evidences#ECO:0000305">More...</a></p> Manual assertion inferred by curator fromi

              Ref.18

               "Processing and turnover of the Hedgehog protein in the endoplasmic reticulum."

               , 
               , 
               , 
               , 
               , 
               , 
               , 
               , 
               , 

                [
               ] [
               ] [
               ]

               Cited for: REVIEW, FUNCTION.

          3 Publications <p>Manually curated information for which there is published experimental evidence.</p> <p><a href="/manual/evidences#ECO:0000269">More...</a></p> Manual assertion based on experiment ini

              Ref.6

               "Proteolytic processing yields two secreted forms of sonic hedgehog."

               , 
               , 

                [
               ] [
               ] [
               ]

               Cited for: PROTEOLYTIC PROCESSING, GLYCOSYLATION, SUBCELLULAR LOCATION.

              Ref.7

               "Floor plate and motor neuron induction by different concentrations of the amino-terminal cleavage product of sonic hedgehog autoproteolysis."

               , 
               , 
               , 
               , 
               , 
               , 

                [
               ] [
               ] [
               ]

               Cited for: FUNCTION, PROTEOLYTIC PROCESSING, AUTOCATALYTIC CLEAVAGE.

              Ref.8

               "Cholesterol modification of hedgehog signaling proteins in animal development."

               , 
               , 

                [
               ] [
               ] [
               ]

               Cited for: CHOLESTERYLATION AT GLY-198, FUNCTION.

          "

Google SDT Tool

Leaves in the text that is removed by Any23; however, it is still not easy to read and has weird bits in it. Better than Any23 though.

Screenshot 2019-05-06 at 15 10 24

Extruct

Behaves in the same way as Google.

kcmcleod commented 5 years ago

Similar effect with hasPart. This time the issue is the markup which creates nodes with no content.

This html:

<div class="annotation" property="hasPart" typeof="CreativeWork">Belongs to the 
  <a href="/uniprot/?query=family:%22hedgehog+family%22&amp;sort=score">hedgehog family</a>.
  <span class="attribution ECO305">
    <span class="attributionHeader tooltipped" title="Manual assertion inferred by 
    curator">Curated
    </span>
  </span>
</div>

Produces the following raw triples:

genid-2f27fdee3aaf4285a4db8253476df489-n61  http://www.w3.org/1999/02/22-rdf-syntax-ns#type  http://schema.org/CreativeWork .
http://purl.uniprot.org/uniprot/Q62226  http://schema.org/hasPart  genid-2f27fdee3aaf4285a4db8253476df489-n61 .

I convert to:

http://bioschemas.org/crawl/v1/30/www.uniprot.org/uniprot/Q62226/1168557303  http://www.w3.org/1999/02/22-rdf-syntax-ns#type  http://schema.org/CreativeWork .
http://purl.uniprot.org/uniprot/Q62226  http://schema.org/hasPart  http://bioschemas.org/crawl/v1/30/www.uniprot.org/uniprot/Q62226/1168557303 .

Thus we no longer have blank nodes BUT we do have nodes with basically no information. On this single page there seems to be more than 10 instances of this. Ultimately produces a very cluttered and unuseful page.

Google SDT Tool

Same result:

Screenshot 2019-05-06 at 14 42 30
kcmcleod commented 5 years ago

Difference between any23 & google

HTML source:

<div property="hasPart" class="annotation">
   <ul class="noNumbering subcellLocations">
      <li class="Nucleus">
         <h6>Nucleus</h6>
         <ul>
            <li>
               <a href="/locations/SL-0191">Nucleus </a><a class="icon icon-generic tooltipped" data-tippy="The nucleus is the most obvious organelle in any eukaryotic cell. It is a membrane-bound organelle surrounded by double membranes which contains most of the cell's genetic material. It communicates with the surrounding cytosol via numerous nuclear pores." data-icon="i"></a> 
               <span class="attribution ECO269">
                  <span class="attributionHeader ">1 Publication<span class="showHideEvidence caret_grey displayThisInline"></span></span>
                  <span style="display:none" class="evidenceContainer">
                     <p class="attributionExplain"><span class="context-help tooltipped-click html tipId-1">Manual assertion based on experiment in<sup>i</sup></span></p>
                     <ul>
                        <li>
                           <div class="Q8K330#ref1 referenceAttribution">
                              <div class="reference_header">Ref.1</div>
                              <div class="reference_content">
                                 <div property="citation" resource="http://purl.uniprot.org/citations/14531860" typeof="ScholarlyArticle"><strong property="name">"Differential activities, subcellular distribution and tissue expression patterns of three members of Slingshot family phosphatases that dephosphorylate cofilin."</strong><br/><a href="/uniprot/?query=author:%22Ohta+Y.%22&amp;sort=score" rel="nofollow">Ohta Y.</a>, <a href="/uniprot/?query=author:%22Kousaka+K.%22&amp;sort=score" rel="nofollow">Kousaka K.</a>, <a href="/uniprot/?query=author:%22Nagata-Ohashi+K.%22&amp;sort=score" rel="nofollow">Nagata-Ohashi K.</a>, <a href="/uniprot/?query=author:%22Ohashi+K.%22&amp;sort=score" rel="nofollow">Ohashi K.</a>, <a href="/uniprot/?query=author:%22Muramoto+A.%22&amp;sort=score" rel="nofollow">Muramoto A.</a>, <a href="/uniprot/?query=author:%22Shima+Y.%22&amp;sort=score" rel="nofollow">Shima Y.</a>, <a href="/uniprot/?query=author:%22Niwa+R.%22&amp;sort=score" rel="nofollow">Niwa R.</a>, <a href="/uniprot/?query=author:%22Uemura+T.%22&amp;sort=score" rel="nofollow">Uemura T.</a>, <a href="/uniprot/?query=author:%22Mizuno+K.%22&amp;sort=score" rel="nofollow">Mizuno K.</a><br/><a href="http://dx.doi.org/10.1046/j.1365-2443.2003.00678.x">Genes Cells 8:811-824(2003)</a>  [<a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/14531860">PubMed</a>] [<a property="sameAs" href="https://europepmc.org/abstract/MED/14531860">Europe PMC</a>] [<a href="/citations/14531860">Abstract</a>]</div>
                                 <div class="citedFor"><span class="details"><strong>Cited for:</strong></span> NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1), FUNCTION, SUBCELLULAR LOCATION, TISSUE SPECIFICITY, DEVELOPMENTAL STAGE, MUTAGENESIS OF CYS-410.</div>
                              </div>
                           </div>
                        </li>
                     </ul>
                  </span>
               </span>
            </li>
         </ul>
      </li>
      <li class="Cytoskeleton">
         <h6>Cytoskeleton</h6>
         <ul>
            <li>
               <a href="/locations/SL-0090">cytoskeleton </a><a class="icon icon-generic tooltipped" data-tippy="The cytoskeleton is a dynamic three-dimensional structure that fills the cytoplasm of cells. The cytoskeleton is responsible for cell movement, cytokinesis, and the organization of the organelles or organelle-like structures within the cell. The major components of the cytoskeleton are the microfilaments (of actin), microtubules (of tubulin), the intermediate filament systems and a fourth group, the MinD-ParA group, that appears to be unique to bacteria." data-icon="i"></a> 
               <span class="attribution ECO269">
                  <span class="attributionHeader ">1 Publication<span class="showHideEvidence caret_grey displayThisInline"></span></span>
                  <span style="display:none" class="evidenceContainer">
                     <p class="attributionExplain"><span class="context-help tooltipped-click html tipId-1">Manual assertion based on experiment in<sup>i</sup></span></p>
                     <ul>
                        <li>
                           <div class="Q8K330#ref1 referenceAttribution">
                              <div class="reference_header">Ref.1</div>
                              <div class="reference_content">
                                 <div property="citation" resource="http://purl.uniprot.org/citations/14531860" typeof="ScholarlyArticle"><strong property="name">"Differential activities, subcellular distribution and tissue expression patterns of three members of Slingshot family phosphatases that dephosphorylate cofilin."</strong><br/><a href="/uniprot/?query=author:%22Ohta+Y.%22&amp;sort=score" rel="nofollow">Ohta Y.</a>, <a href="/uniprot/?query=author:%22Kousaka+K.%22&amp;sort=score" rel="nofollow">Kousaka K.</a>, <a href="/uniprot/?query=author:%22Nagata-Ohashi+K.%22&amp;sort=score" rel="nofollow">Nagata-Ohashi K.</a>, <a href="/uniprot/?query=author:%22Ohashi+K.%22&amp;sort=score" rel="nofollow">Ohashi K.</a>, <a href="/uniprot/?query=author:%22Muramoto+A.%22&amp;sort=score" rel="nofollow">Muramoto A.</a>, <a href="/uniprot/?query=author:%22Shima+Y.%22&amp;sort=score" rel="nofollow">Shima Y.</a>, <a href="/uniprot/?query=author:%22Niwa+R.%22&amp;sort=score" rel="nofollow">Niwa R.</a>, <a href="/uniprot/?query=author:%22Uemura+T.%22&amp;sort=score" rel="nofollow">Uemura T.</a>, <a href="/uniprot/?query=author:%22Mizuno+K.%22&amp;sort=score" rel="nofollow">Mizuno K.</a><br/><a href="http://dx.doi.org/10.1046/j.1365-2443.2003.00678.x">Genes Cells 8:811-824(2003)</a>  [<a property="sameAs" href="https://www.ncbi.nlm.nih.gov/pubmed/14531860">PubMed</a>] [<a property="sameAs" href="https://europepmc.org/abstract/MED/14531860">Europe PMC</a>] [<a href="/citations/14531860">Abstract</a>]</div>
                                 <div class="citedFor"><span class="details"><strong>Cited for:</strong></span> NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1), FUNCTION, SUBCELLULAR LOCATION, TISSUE SPECIFICITY, DEVELOPMENTAL STAGE, MUTAGENESIS OF CYS-410.</div>
                              </div>
                           </div>
                        </li>
                     </ul>
                  </span>
               </span>
            </li>
         </ul>
      </li>
   </ul>
</div>

Output from Google:

Screenshot 2019-05-09 at 09 18 16

To view this on Google: https://search.google.com/structured-data/testing-tool#url=https%3A%2F%2Fwww.uniprot.org%2Funiprot%2FQ8K330

Triple produced by any23:

http://purl.uniprot.org/uniprot/Q8K330  http://schema.org/hasPart  

           Cytoskeleton

             cytoskeleton  1 PublicationManual assertion based on experiment ini

                    Ref.1

                     "Differential activities, subcellular distribution and tissue expression patterns of three members of Slingshot family phosphatases that dephosphorylate cofilin."

                     , 
                     , 
                     , 
                     , 
                     , 
                     , 
                     , 
                     , 

                      [
                     ] [
                     ] [
                     ]

                     Cited for: NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1), FUNCTION, SUBCELLULAR LOCATION, TISSUE SPECIFICITY, DEVELOPMENTAL STAGE, MUTAGENESIS OF CYS-410.

           Nucleus

             Nucleus  1 PublicationManual assertion based on experiment ini

                    Ref.1

                     "Differential activities, subcellular distribution and tissue expression patterns of three members of Slingshot family phosphatases that dephosphorylate cofilin."

                     , 
                     , 
                     , 
                     , 
                     , 
                     , 
                     , 
                     , 

                      [
                     ] [
                     ] [
                     ]

                     Cited for: NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1), FUNCTION, SUBCELLULAR LOCATION, TISSUE SPECIFICITY, DEVELOPMENTAL STAGE, MUTAGENESIS OF CYS-410.

Notice the order in the HTML is Nucleus then Cytoskeleton, which is the order Google has too. HOWEVER, the order is reversed by any23. Furthermore, notice how much of the text found by Google is not detected by Any23.

ALSO notice that much of the text inside the HTML has completely gone from both Google and any23. E.g., The HTML says "The cytoskeleton is a dynamic three-dimensional structure that fills the cytoplasm of cells", but this is missing from both Google and any23.