Norconex / committer-neo4j

Implementation of Norconex Committer for Neo4j.
https://opensource.norconex.com/committers/neo4j/
Apache License 2.0
2 stars 1 forks source link

Is it possible to store property in Relationships? #4

Closed LeMoussel closed 4 years ago

LeMoussel commented 4 years ago

In HTML, for an anchor Tag (relationship) I want to store the class attribute (<a class="content-link" href="http://example.com">).

relationships Tag define relationships between nodes. How can I set Neo4j relation property class for relationships?

sylvainroussy commented 4 years ago

Hi! No, it's not possible for now. I tag it for a future release.

LeMoussel commented 4 years ago

Hi Sylvain,

I'm french. Plus simple d'échanger en français .....

Une idée sur la date de la prochaine release intégrant cela ? Dans l'attente , une solution de contournement est elle possible ? une relation entre chaque nœud ayant l’attribut type positionné avec la valeur de chaque propriété de la relation ?

sylvainroussy commented 4 years ago

Comme il s'agit d'un site essentiellement lu par des anglophones, je vais continuer de faire l'effort d'écrire en anglais afin d'être compris par tous même si mon anglais est perfectible.

Next release is 2.0 available soon but without this enhancement. Unfortunatly, I can't say when this feature will be added right now. Is it critical for you?

LeMoussel commented 4 years ago

Itou pour moi mon anglais est aussi perfectible.

Yes. As part of an R&D project this is essential for me. I think that given the power of Neo4j, this is an important feature. If need be, I can participate to the tests

sylvainroussy commented 4 years ago

It is possible to have this behaviour with two processes (or one but less readable configuration).

On the second configuration, you could configure the crawler with the following parts :

<importer>
        <preParseHandlers>
          <splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
            selector="a"
            parser="html"/>                  
          <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <restrictTo caseSensitive="false" field="document.reference">
                .*#.*
            </restrictTo>
              <dom selector="a"  toField="link_class"   extract="attr(class)"/>
              <dom selector="a"  toField="link_url"   extract="attr(href)"/>
              <dom selector="a"  toField="link_target"   extract="attr(target)"/>
              <dom selector="a"  toField="link_text"   extract="ownText"/>
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger"
              onConflict="replace" >
            <restrictTo caseSensitive="false" field="document.reference">
                .*#.*
            </restrictTo>
            <constant name="TYPE">LINK</constant>
          </tagger>
        </preParseHandlers>
        <postParseHandlers>
           <filter class="com.norconex.importer.handler.filter.impl.RegexReferenceFilter" onMatch="include">
            <regex>
               .*#.*
            </regex>          
          </filter>
        </postParseHandlers>   
</importer>

And the for the relationships configuration:

 <relationships>
      <relationship type="TO_PAGE" direction="OUTGOING" targetFindSyntax="MERGE">
        <sourcePropertyKey label="LINK">link_url</sourcePropertyKey>
        <targetPropertyKey label="Page">identity</targetPropertyKey>
      </relationship>
       <relationship type="FROM_PAGE" direction="OUTGOING" targetFindSyntax="MERGE">
         <sourcePropertyKey label="LINK">link_url</sourcePropertyKey>
        <targetPropertyKey label="Page">collector.referrer-reference</targetPropertyKey>
      </relationship>
   </relationships>

The main idea consists on splitting documents on each html tag. The result looks like: (:Page)<-[:FROM_PAGE]-(:LINK)-[:TO_PAGE]->(:Page)

Then, if you want to clean your graph and remove the LINK nodes you have to execute the following CYPHER query:

MATCH (p1:Page)<-[rFrom:FROM_PAGE]-(link:LINK)-[rTo:TO_PAGE]->(p2:Page)
MERGE (p1)-[r:LINKED_TO]->(p2)
SET r+= link
DETACH DELETE link

Does that help you?

LeMoussel commented 4 years ago

Thanks for your help. Very well explained. I will test.