fivesmallq / web-data-extractor

Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.
https://fivesmallq.github.io/web-data-extractor
Apache License 2.0
54 stars 19 forks source link

Embeddable model support #22

Open fivesmallq opened 8 years ago

ptyagi108 commented 8 years ago

This will work only when Config is known at compile time, incase Config property is chosen at run-time this solution may not work. for example in case field Config in class Activity is interface or Abstract superclass and is implemented/extended by two different classes say Config1 and Config2 and implementing class(Config1 or Config2 ) is chosen at runtime based on some condition then this solution will fail.Also Config1 and Config2 can have additional properties which are not available in superclass Config then we will not be able to populate additional properties as actual class is not known at compile time.

There should be some way of mentioning actual implementing class (in this case Config1 or Config2 ), something like ...

             List<Activity> activities = Extractors.on(base5Xml)
            .split(xpath("//ProcessDefinition/activity").removeNamespace())
            .extract("name", xpath("//activity/@name"))
            .extract("type", xpath("//activity/type/text()"))
            .extract("resourceType", xpath("//activity/resourceType/text()"))
            .extract("config",**new EntityExtractor<Config>() {
            @Override
            public Config extract(String data) {
            return Extractors.on(data)
               .extract("encoding", xpath("//activity/config/encoding/text()"))
               .extract("pollInterval", xpath("//activity/config/pollInterval/text()")).asBean(Config1 .class))**
           .asBeanList(Activity.class);

Where Config is Abstract class and Config1 and Config2 extends Config as below,

public abstract class Config { // common options:

protected String encoding;

public class Config1 extends Config{

// consumer options

private String pollInterval;

private String createEvent;

private String modifyEvent;

private String deleteEvent;

private String mode;

private String sortby;

private String sortorder;

public class Config2 extends Config {

// producer options private String compressFile;

XML

In the XML there are two Activities (activity 1 & activity 2),Now Activity 1 will be assigned Config 2 and Activity 2 will be assigned Config 1 based on pd:resourceType,

<?xml version="1.0" encoding="UTF-8"?>
<pd:ProcessDefinition xmlns:pd="http://xmlns.tibco.com/bw/process/2003" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                      xmlns:pfx="http://www.tibco.com/namespaces/tnt/plugins/file">
    <pd:name>Processes/Simple Process.process</pd:name>
    <pd:startName>File Poller</pd:startName>
    <pd:startX>0</pd:startX>
    <pd:startY>0</pd:startY>
    <pd:returnBindings/>
    <pd:starter name="File Poller">
        <pd:type>com.tibco.plugin.file.FileEventSource</pd:type>
        <pd:resourceType>ae.activities.FileEventSourceResource</pd:resourceType>
        <pd:x>245</pd:x>
        <pd:y>96</pd:y>
        <config>
            <pollInterval>5</pollInterval>
            <createEvent>true</createEvent>
            <modifyEvent>true</modifyEvent>
            <deleteEvent>true</deleteEvent>
            <mode>files-and-directories</mode>
            <encoding>text</encoding>
            <sortby>File Name</sortby>
            <sortorder>descending</sortorder>
            <fileName>C:\Projects\SampleProject\Input\inputData.xml</fileName>
        </config>
        <pd:inputBindings/>
    </pd:starter>
    <pd:endName>End</pd:endName>
    <pd:endX>540</pd:endX>
    <pd:endY>97</pd:endY>
    <pd:errorSchemas/>
    <pd:processVariables/>
    <pd:targetNamespace>http://xmlns.example.com/1465977202414</pd:targetNamespace>
    <pd:activity name="activity 1">
        <pd:type>com.tibco.plugin.file.FileWriteActivity</pd:type>
        <pd:resourceType>ae.activities.FileWriteActivity</pd:resourceType>
        <pd:x>387</pd:x>
        <pd:y>104</pd:y>
        <config>
            <encoding>text</encoding>
            <compressFile>None</compressFile>
        </config>
        <pd:inputBindings>
            <pfx:WriteActivityInputTextClass>
                <fileName>
                    <xsl:value-of select="$_globalVariables/ns:GlobalVariables/GlobalVariables/OutputLocation"/>
                </fileName>
                <textContent>
                    <xsl:value-of select="$File-Poller/pfx:EventSourceOuputTextClass/fileContent/textContent"/>
                </textContent>
            </pfx:WriteActivityInputTextClass>
        </pd:inputBindings>
    </pd:activity>
  <pd:activity name="activity 2">
        <pd:type>com.tibco.plugin.file.FileEventSource</pd:type>
        <pd:resourceType>ae.activities.FileEventSourceResource</pd:resourceType>
        <pd:x>240</pd:x>
        <pd:y>90</pd:y>
        <config>
            <pollInterval>50</pollInterval>
            <createEvent>false</createEvent>
            <modifyEvent>true</modifyEvent>
            <deleteEvent>true</deleteEvent>
            <mode>files-and-directories</mode>
            <encoding>text</encoding>
            <sortby>File Name</sortby>
            <sortorder>descending</sortorder>
            <fileName>C:\Projects\SampleProject\output\outputData.xml</fileName>
        </config>
        <pd:inputBindings/>
    </pd:activity>
    <pd:transition>
        <pd:from>Output</pd:from>
        <pd:to>End</pd:to>
        <pd:lineType>Default</pd:lineType>
        <pd:lineColor>-16777216</pd:lineColor>
        <pd:conditionType>always</pd:conditionType>
    </pd:transition>
    <pd:transition>
        <pd:from>File Poller</pd:from>
        <pd:to>Output</pd:to>
        <pd:lineType>Default</pd:lineType>
        <pd:lineColor>-16777216</pd:lineColor>
        <pd:conditionType>always</pd:conditionType>
    </pd:transition>
</pd:ProcessDefinition>```

suppose i have a requirement that 
(1)Config 2 is assigned to Activity 1 if resourceType is ae.activities.FileWriteActivity 
(2)Config 1 is assigned to Activity 2 if resourceType is ae.activities.FileEventSourceResource

I would like to approach it something like this...
//Config 2 is assigned to Activity 1 as resourceType is ae.activities.FileWriteActivity 

```java
if(extract("resourceType", xpath("//activity/resourceType/text()")).asString().equals("ae.activities.FileWriteActivity"))

 then
  List<Activity> activities = Extractors.on(base5Xml)
                .split(xpath("//ProcessDefinition/activity").removeNamespace())
                .extract("name", xpath("//activity/@name"))
                .extract("type", xpath("//activity/type/text()"))
                .extract("resourceType", xpath("//activity/resourceType/text()"))
                .extract("config",**new EntityExtractor<Config>() {
                @Override
                public Config extract(String data) {
                return Extractors.on(data)
                   .extract("encoding", xpath("//activity/config/encoding/text()"))
                   .extract("compressFile", xpath("//activity/config/compressFile/text()")).asBean(Config2.class))**
               .asBeanList(Activity.class);

//Config 1 is assigned to Activity 2 as resourceType is ae.activities.FileEventSourceResource

else if(extract("resourceType", xpath("//activity/resourceType/text()")).asString().equals("ae.activities.FileEventSourceResource"))

List<Activity> activities = Extractors.on(base5Xml)
                .split(xpath("//ProcessDefinition/activity").removeNamespace())
                .extract("name", xpath("//activity/@name"))
                .extract("type", xpath("//activity/type/text()"))
                .extract("resourceType", xpath("//activity/resourceType/text()"))
                .extract("config",**new EntityExtractor<Config>() {
                @Override
                public Config extract(String data) {
                return Extractors.on(data)
                   .extract("encoding", xpath("//activity/config/encoding/text()"))
                   .extract("pollInterval", xpath("//activity/config/pollInterval/text()")).asBean(Config1.class))**
               .asBeanList(Activity.class); ```
fivesmallq commented 8 years ago

@ptyagi108 maybe you can use filter to process this.

                .extract("config.pollInterval", xpath("//activity/config/pollInterval/text()"))
                       //if pollInterval is null set to default '5'
                      .filter(value -> value == null ? value : "5")
                .extract("config.compressFile", xpath("//activity/config/compressFile/text()"))

https://github.com/fivesmallq/web-data-extractor/blob/master/src/test/java/im/nll/data/extractor/ExtractorsTest.java#L513

or you can set the default value to the config field ?

ptyagi108 commented 8 years ago

Pls check...I have updated my comments..filter may not work...

ptyagi108 commented 8 years ago

Please see my updated comments, Basic issue is how to handle polymorphism with this library.

fivesmallq commented 8 years ago

@ptyagi108 OK, I Will think about it.

ptyagi108 commented 8 years ago

Currentely i have made some changes to library in my local to get my case working , please suggest how can i make this implementation better.

public interface Extractor<T> {
T extract(String data);

}

im.nll.data.extractor.Extractors#extractBean

 private <T> T extractBean(String html, Class<T> clazz) {
    // only support String type
    if (clazz.equals(String.class)) {
        return (T) new String(html);
    }
    T entity = Reflect.on(clazz).create().get();
    for (Map.Entry<String, List<Extractor>> one : extractorsMap.entrySet()) {
        String name = one.getKey();
        List<Extractor> extractors = one.getValue();
        String result = html;
        for (Extractor extractor : extractors) {
            if(!(extractor.extract(result) instanceof String))
            {
                Reflect.on(entity).set(name, extractor.extract(result));
                return entity;
            }
            result =(String) extractor.extract(result);
        }
        result = filterBefore(result);
        result = filter(name, result);
        result = filterAfter(result);
        try {
            Reflect.on(entity).set(name, result);
        } catch (Exception e) {
            LOGGER.error("convert to bean error! can't set '{}' with '{}'", name, result, e);
        }
    }
    return entity;
}

How to use it.

@Test public void testToBeanListByXPath() throws Exception { List languages = Extractors.on(listHtml).split(xpath("//tr[@class='item']")) .extract("type", xpath("//td[1]/text()")) .extract("name", xpath("//td[2]/text()")) .extract("url", xpath("//td[3]/text()")) .extract("book", new Extractor() { @Override public Book extract(String data) { return Extractors.on(data) .extract("category", xpath("//td[2]/text()")) .extract("author", xpath("//td[3]/text()")) .asBean(Book.class); } }) .asBeanList(Language.class); Assert.assertNotNull(languages); Language second = languages.get(1); Assert.assertEquals(languages.size(), 3); Assert.assertEquals(second.getType(), "dynamic"); Assert.assertEquals(second.getName(), "Ruby"); Assert.assertEquals(second.getUrl(), "https://www.ruby-lang.org"); }

public class Language {
private String type;
private String name;
private String url;
private Book book;
fivesmallq commented 8 years ago

@ptyagi108 maybe i should add a method called extractBean in Extractors and it will set the bean to the field?

ptyagi108 commented 8 years ago

Yes..this would be better solution..