Norconex / committer-elasticsearch

Implementation of Norconex Committer for Elasticsearch.
https://opensource.norconex.com/committers/elasticsearch/
Apache License 2.0
11 stars 6 forks source link

Facebook crawler using on events, elastic committer commits the data in an array or block #11

Closed Songi221 closed 7 years ago

Songi221 commented 7 years ago

I'm using Norconex crawler on facebook Graph API /events/ and it is crawling down the data, but when it commits it to the elastic kibana sees the data in one block, so it cannot "index" it.

As I know it should put the each element one by one, but rather it put's as big arrays of elements and kibana cannot identify the fields

I attach an image to show it. screen shot 2017-04-24 at 12 15 05

essiembre commented 7 years ago

Can you query Elasticsearch directly and find out if the document are stored properly there? If the events are stored into one block, it may be because you received them all form Facebook at one block. In such case, you probably need to "split" each events into individual documents. This is done by implementing IDocumentSplitter. Some exiting implementation (and config usage) can be found here.

Songi221 commented 7 years ago

U r right, The splitter isn't good, I tried to modify ur FacebookDocumentSplitter, which is working well by the way, but if I comment out the isPosts function and change the Parse values it still committing it as one document, with post it is committing it separately can u help me what am I doing wrong with it if I want to use on events? Thanks in advance. here is the modification what I wanted to make:

package com.norconex.blog.facebook.crawler;

import java.io.IOException;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import javax.xml.stream.XMLStreamException;

import org.apache.commons.configuration.XMLConfiguration;
import org.apache.commons.lang3.StringUtils;

import com.google.gson.JsonArray;
import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import com.google.gson.stream.JsonReader;
import com.norconex.commons.lang.io.CachedStreamFactory;
import com.norconex.commons.lang.xml.EnhancedXMLStreamWriter;
import com.norconex.importer.doc.ImporterDocument;
import com.norconex.importer.doc.ImporterMetadata;
import com.norconex.importer.handler.ImporterHandlerException;
import com.norconex.importer.handler.splitter.AbstractDocumentSplitter;
import com.norconex.importer.handler.splitter.SplittableDocument;

public class FacebookDocumentSplitter extends AbstractDocumentSplitter {

    public FacebookDocumentSplitter() {
    }

    protected List<ImporterDocument> splitApplicableDocument(
            SplittableDocument parentDoc, OutputStream output,
            CachedStreamFactory streamFactory, boolean parsed)
            throws ImporterHandlerException {

        // First, we make sure only Facebook posts are split here but returning
        // null on non-post references.
       /* if (!FacebookUtils.isPosts(parentDoc.getReference())) {
            return null;
        } */

        List<ImporterDocument> postDocs = new ArrayList<>();
        ImporterMetadata parentMeta = parentDoc.getMetadata();

        JsonReader jsonReader = new JsonReader(parentDoc.getReader());
        jsonReader.setLenient(true);

        JsonObject json = null;
        try {
            json = (JsonObject) new JsonParser().parse(jsonReader);
        } catch (ClassCastException e) {
            throw new ImporterHandlerException("Cannot parse JSON input.", e);
        }

        // Each top-level "data" element is a single document/post.
        JsonArray postsData = json.get("data").getAsJsonArray();
        Iterator<JsonElement> it = postsData.iterator();
        while (it.hasNext()) {
            JsonObject postData = (JsonObject) it.next();
            try {
                ImporterDocument doc = createImportDocument(
                        postData, parentDoc.getReference(), parentMeta,
                        streamFactory);
                if (doc != null) {
                    postDocs.add(doc);
                }
            } catch (IOException e) {
                throw new ImporterHandlerException(e);
            }
        }
        return postDocs;
    }

    protected ImporterDocument createImportDocument(
            JsonObject json, String parentRef, ImporterMetadata parentMeta,
            CachedStreamFactory streamFactory) 
                    throws IOException {

        if (json == null) {
            return null;
        }

        // METADADA
        ImporterMetadata childMeta = new ImporterMetadata();

        // Optionally assign values from master JSON document
//        childMeta.load(parentMeta);

        // Parse and assign any values you need for a child
        //attending_count,start_time,end_time,category,name,place,description
        String id = getString(json, "id");
        childMeta.setString("id", id);
        childMeta.setString("attending_count", getString(json, "attending_count"));
        childMeta.setString("start_time", getString(json, "start_time"));
        childMeta.setString("end_time", getString(json, "end_time"));
        //childMeta.setString("category", getString(json, "category"));
        childMeta.setString("place", getString(json, "place"));
        childMeta.setString("name", getString(json, "name"));
        //childMeta.setString("description", getString(json, "description"));
        //childMeta.setString("created_time", getString(json, "created_time"));

        JsonObject from = (JsonObject) json.get("from");
        String fromName = getString(from, "name");
        childMeta.setString("from_name", fromName);
        childMeta.setString("from_id", getString(from, "id"));

        // Consider "message" as the document "content".
        String content = getString(json, "description");
        if (StringUtils.isBlank(content)) {
            return null;
        }

        // Create a unique reference for this child element.  Let's make it
        // the URL to access this post in a browser.
        // e.g. https://www.facebook.com/Disney/posts/10152684762485954
        String ref = "https://www.facebook.com/" + fromName + "/events/" 
                + StringUtils.substringAfter(id, "_");
        childMeta.setString(ImporterMetadata.DOC_REFERENCE, ref);

        // Set parent reference if you need it, and optionally remove 
        // the access token from it to keep it clean
//        String parentReference = parentRef.replaceFirst(
//                "(.*)&access_token=.*", "$1");
//        childMeta.setEmbeddedParentReference(parentReference);
//        childMeta.setEmbeddedReference(ref);

        // We gathered enough data for a single doc, return it
        return new ImporterDocument(
                ref, streamFactory.newInputStream(content), childMeta);
    }

    // Convenience method for getting a string for a JSON object.
    private String getString(JsonObject jsonObject, String memberName) {
        if (jsonObject == null) {
            return null;
        }
        JsonElement element = jsonObject.get(memberName);
        if (element != null) {
            return element.getAsString();
        }
        return null;
    }

    @Override
    protected void loadHandlerFromXML(XMLConfiguration xml) throws IOException {
        // nothing extra to load
    }

    @Override
    protected void saveHandlerToXML(EnhancedXMLStreamWriter writer)
            throws XMLStreamException {
        // nothing extra to save
    }
}
Songi221 commented 7 years ago

can u point out my mistake pls, or showing me the way of easy debugging? thanks

essiembre commented 7 years ago

Not sure what your issue is. You say if you keep if (!FacebookUtils.isPosts(parentDoc.getReference())) { it works? Why can't you keep it?

Songi221 commented 7 years ago

Commenting that out was just an idea, I though if it's an event that could give a false to isPosts.

If I give a URL which is aiming to a page's posts then it is splitting it well. But when I try it with modified field setup ( which would get the name and place etc. from the events) on page's events, then it is giving it back as one document, not splitting anything.

Is there a way to debug it easily? Thanks, On Fri, 28 Apr 2017 at 5.10, Pascal Essiembre notifications@github.com wrote:

Not sure what your issue is. You say if you keep if (!FacebookUtils.isPosts(parentDoc.getReference())) { it works? Why can't you keep it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/committer-elasticsearch/issues/11#issuecomment-297895827, or mute the thread https://github.com/notifications/unsubscribe-auth/AJfNexFZmokmFc6IMlWk6b94_f6lw-6Qks5r0Vi3gaJpZM4NGQYb .

essiembre commented 7 years ago

I see... for events, it should be the same thing.. Look at these lines:

JsonArray postsData = json.get("data").getAsJsonArray();
        Iterator<JsonElement> it = postsData.iterator();

It expects a JSON resonse with a data element in it, which is an array of posts. In your case, see if you have the equivalen for an array of events. The postData in your case should contain event. If they are not split, this is where you will see it. You can run it in debug mode if you use an IDE. Else, you can add debugging statements to see what is happening.

For instance, one thing you can do is print the JSON content received so you can analyse it and see how it differs from when you are using it with posts. Adding something like this could help:

System.out.println(IOUtils.toString(parentDoc.getReader()));
Songi221 commented 7 years ago

Okey, so I made process, I see now it looking the same as the posts, but it has more attributes, and what is important, that Facebook stores the events differently than the posts, while posts are reachable this way: https://www.facebook.com/Disney/posts/10152684762485954 events aren't has the page id/events. thay are reacheble like this: https://www.facebook.com/events/1311554328897174/

my parsing looks like this: ` String id = getString(json, "id"); childMeta.setString("id", id); //childMeta.setString("attending_count", getString(json, "attending_count")); childMeta.setString("start_time", getString(json, "start_time")); childMeta.setString("end_time", getString(json, "end_time")); //childMeta.setString("category", getString(json, "category")); //childMeta.setString("place", getString(json, "place")); childMeta.setString("name", getString(json, "name")); childMeta.setString("description", getString(json, "description")); //childMeta.setString("created_time", getString(json, "created_time"));

    //JsonObject from = (JsonObject) json.get("from");
    String fromName = "budapestpark";//getString(from, "name");
   //childMeta.setString("from_name", fromName);
    //childMeta.setString("from_id", getString(from, "id"));

    // Consider "message" as the document "content".
    //String content = getString(json, "name");
    String content = getString(json, "name");
    if (StringUtils.isBlank(content)) {
        return null;
    }

    // Create a unique reference for this child element.  Let's make it
    // the URL to access this post in a browser.
    // e.g. https://www.facebook.com/Disney/posts/10152684762485954
    String ref = "https://www.facebook.com/events/" /*+ fromName + "/events/" */
            + /*StringUtils.substringAfter(id, "_")*/ id;
    childMeta.setString(ImporterMetadata.DOC_REFERENCE, ref);

    // Set parent reference if you need it, and optionally remove 
    // the access token from it to keep it clean

// String parentReference = parentRef.replaceFirst( // "(.)&access_token=.", "$1"); // childMeta.setEmbeddedParentReference(parentReference); // childMeta.setEmbeddedReference(ref);

    // We gathered enough data for a single doc, return it
  //  LOG.info(content);
    return new ImporterDocument(
            ref, streamFactory.newInputStream(content), childMeta);`

and now it is splitting the document into individual events, when I log it out I see it, but I get this error message: Facebook_32_Posts.txt

that the second loop it cannot parse the json. I think my problem is somewhere around the return part

return new ImporterDocument( ref, streamFactory.newInputStream(content), childMeta);

I guess the second parameter is wrong, but ur FacebookPost case it had just the message in it which is a string, so I don't see the problem there. I upload the whole splitterfile to be able to check:

` package com.norconex.blog.facebook.crawler;

import java.io.IOException; import java.io.OutputStream; import java.util.ArrayList; import java.util.Iterator; import java.util.List;

import javax.xml.stream.XMLStreamException;

import org.apache.commons.configuration.XMLConfiguration; import org.apache.commons.lang3.StringUtils; import org.apache.commons.io.IOUtils; import org.apache.log4j.LogManager; import org.apache.log4j.Logger;

import com.google.gson.JsonArray; import com.google.gson.JsonElement; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import com.google.gson.stream.JsonReader; import com.norconex.commons.lang.io.CachedStreamFactory; import com.norconex.commons.lang.xml.EnhancedXMLStreamWriter; import com.norconex.importer.doc.ImporterDocument; import com.norconex.importer.doc.ImporterMetadata; import com.norconex.importer.handler.ImporterHandlerException; import com.norconex.importer.handler.splitter.AbstractDocumentSplitter; import com.norconex.importer.handler.splitter.SplittableDocument;

public class FacebookDocumentSplitter extends AbstractDocumentSplitter {

public FacebookDocumentSplitter() {
}
private static final Logger LOG = 
        LogManager.getLogger(FacebookDocumentSplitter.class);

protected List<ImporterDocument> splitApplicableDocument(
        SplittableDocument parentDoc, OutputStream output,
        CachedStreamFactory streamFactory, boolean parsed)
        throws ImporterHandlerException{

    // First, we make sure only Facebook posts are split here but returning
    // null on non-post references.
   /* if (!FacebookUtils.isPosts(parentDoc.getReference())) {
        return null;
    } */
    //LOG.info("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!");
    List<ImporterDocument> postDocs = new ArrayList<>();
    ImporterMetadata parentMeta = parentDoc.getMetadata();

    JsonReader jsonReader = new JsonReader(parentDoc.getReader());

    try {
       // LOG.info("jsonreader: " + IOUtils.toString(parentDoc.getReader()));
    }catch (Exception e){

    }
    jsonReader.setLenient(true);

    JsonObject json = null;
    try {
        json = (JsonObject) new JsonParser().parse(jsonReader);
    } catch (ClassCastException e) {
        throw new ImporterHandlerException("Cannot parse JSON input.", e);
    }

    // Each top-level "data" element is a single document/post.
    JsonArray postsData = json.get("data").getAsJsonArray();
    Iterator<JsonElement> it = postsData.iterator();

    //LOG.info(parentDoc.getReader());
    while (it.hasNext()) {
        JsonObject postData = (JsonObject) it.next();
        //LOG.info(postData.toString());
        try {
            ImporterDocument doc = createImportDocument(
                    postData, parentDoc.getReference(), parentMeta,
                    streamFactory);
            if (doc != null) {

                postDocs.add(doc);
            }
        } catch (IOException e) {
            throw new ImporterHandlerException(e);
        }
    }
    return postDocs;
}

protected ImporterDocument createImportDocument(
        JsonObject json, String parentRef, ImporterMetadata parentMeta,
        CachedStreamFactory streamFactory) 
                throws IOException {

    if (json == null) {
        return null;
    }

    // METADADA
    ImporterMetadata childMeta = new ImporterMetadata();

    // Optionally assign values from master JSON document

// childMeta.load(parentMeta);

    // Parse and assign any values you need for a child
    //attending_count,start_time,end_time,category,name,place,description
    String id = getString(json, "id");
    childMeta.setString("id", id);
    //childMeta.setString("attending_count", getString(json, "attending_count"));
    childMeta.setString("start_time", getString(json, "start_time"));
    childMeta.setString("end_time", getString(json, "end_time"));
    //childMeta.setString("category", getString(json, "category"));
    //childMeta.setString("place", getString(json, "place"));
    childMeta.setString("name", getString(json, "name"));
    childMeta.setString("description", getString(json, "description"));
    //childMeta.setString("created_time", getString(json, "created_time"));

    //JsonObject from = (JsonObject) json.get("from");
    String fromName = "budapestpark";//getString(from, "name");
   //childMeta.setString("from_name", fromName);
    //childMeta.setString("from_id", getString(from, "id"));

    // Consider "message" as the document "content".
    //String content = getString(json, "name");
    String content = getString(json, "name");
    if (StringUtils.isBlank(content)) {
        return null;
    }

    // Create a unique reference for this child element.  Let's make it
    // the URL to access this post in a browser.
    // e.g. https://www.facebook.com/Disney/posts/10152684762485954
    String ref = "https://www.facebook.com/events/" /*+ fromName + "/events/" */
            + /*StringUtils.substringAfter(id, "_")*/ id;
    childMeta.setString(ImporterMetadata.DOC_REFERENCE, ref);

    // Set parent reference if you need it, and optionally remove 
    // the access token from it to keep it clean

// String parentReference = parentRef.replaceFirst( // "(.)&access_token=.", "$1"); // childMeta.setEmbeddedParentReference(parentReference); // childMeta.setEmbeddedReference(ref);

    // We gathered enough data for a single doc, return it
    LOG.info(content);
    return new ImporterDocument(
            ref, streamFactory.newInputStream(content), childMeta);
}

// Convenience method for getting a string for a JSON object.
private String getString(JsonObject jsonObject, String memberName) {
    if (jsonObject == null) {
        return null;
    }
    JsonElement element = jsonObject.get(memberName);
    if (element != null) {
        return element.getAsString();
    }
    return null;
}

@Override
protected void loadHandlerFromXML(XMLConfiguration xml) throws IOException {
    // nothing extra to load
}

@Override
protected void saveHandlerToXML(EnhancedXMLStreamWriter writer)
        throws XMLStreamException {
    // nothing extra to save
}

} `

essiembre commented 7 years ago

The relevant part of your error is:

Caused by: java.lang.ClassCastException: com.google.gson.JsonPrimitive cannot be cast to com.google.gson.JsonObject``

It looks like you are trying to read a primitive to an object. Searching online for that error got me a few solutions, such as: http://stackoverflow.com/questions/20777884/com-google-gson-jsonprimitive-cannot-be-cast-to-com-google-gson-jsonobject-error

Feel free to contact Norconex if you feel you need more "hands-on" assistance.

essiembre commented 7 years ago

Closing due to lack of feedback. Feel free to re-open with more details if you have not resolved your issue.