apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

Enable extension parsing for SitemapParser #749

Closed jnioche closed 4 years ago

jnioche commented 5 years ago

https://github.com/crawler-commons/crawler-commons/pull/218/ introduced support for sitemap extensions. These are not active by default and should be made configurable. The extension data found (if any) would be added to the outlink metadata and could therefore be used by a ParseFilter.

evanhalley commented 4 years ago

I'm curious about the implementation for this one. It looks the sitemap extension attributes are regular java objects (instead of a Map<String,String> for example). This means we would essentially need a helper class that can convert an extension attributes class from POJO to Strings that we could then store in the metadata object attached to the Outlink. You could then access them by the keys as specified in the Sitemap XML.

For example, using the key video:duration to get the video duraiton out of the metadata object.

Would an implementation like this be a solution for this issue?

sebastian-nagel commented 4 years ago

Hi @evanhalley, yes and some of the attributes are itself complex objects, eg. "video:price".

Maybe it's better to split the implementation into two parts:

  1. add asMap() methods to the attribute objects in crawler-commons. That seems better maintainable than implementing it in storm-crawler. There are already toString() methods for the attributes. The asMap() method should return a Map<String, String[]> because some attributes are multi-valued, eg. video:price which can be specified using multiple currencies.
  2. in storm-crawler: add the code to enable the sitemap extensions by a configuration property and to add the attributes to the metadata. Maybe it's worth to discuss the following points:
    • enable all extensions or make it configurable which extensions
    • (optionally) prefix the metadata keys - sitemap.video:price, cf. #776

What's your opinion?

jnioche commented 4 years ago

thanks @evanhalley and @sebastian-nagel

I had a first prototype to deal with images only and looked like that

  Map<Extension, ExtensionMetadata[]> attributes = smurl.getAttributes();
                if (attributes != null) {
                    // handle image only for now
                    ExtensionMetadata[] emds = attributes.getOrDefault(Extension.IMAGE,
                            new ExtensionMetadata[] {});
                    for (ExtensionMetadata emd : emds) {
                        ImageAttributes e = (ImageAttributes) emd;
                        String val = e.getCaption();
                        if (val != null) {
                            customKeyVal.add(Extension.IMAGE.name()+".caption");
                            customKeyVal.add(val);
                        }
                        val = e.getGeoLocation();
                        if (val != null) {
                            customKeyVal.add(Extension.IMAGE.name()+".geolocation");
                            customKeyVal.add(val);
                        }
                        val = e.getTitle();
                        if (val != null) {
                            customKeyVal.add(Extension.IMAGE.name()+".title");
                            customKeyVal.add(val);
                        }
                        URL u = e.getLicense();
                        if (u != null) {
                            customKeyVal.add(Extension.IMAGE.name()+".license");
                            customKeyVal.add(u.toExternalForm());
                        }
                        u = e.getLoc();
                        if (u != null) {
                            customKeyVal.add(Extension.IMAGE.name()+".loc");
                            customKeyVal.add(u.toExternalForm());
                        }
                    }
                }

and a config element to declare that we wanted the images extensions. The plan was to extend to the other types in a similar way, however, as suggested by @sebastian-nagel, having a way of getting the Map from crawler-commons would be a lot nicer and would require less code in StormCrawler.

+1 to using a prefix as well.

evanhalley commented 4 years ago

1) I like splitting it into two pieces, one piece in crawler-commons and the other in storm crawler. More manageable and can be beneficial to others who have a crawler-commons dependency.

2) I like the granular configurability and should be painless to implement (vs. an on/off for all extensions), especially if there is CPU overhead in doing extension parsing.

And +1 for using the prefix as well.

I'll submit an issue to crawler-commons and starting working on that piece.

jnioche commented 4 years ago

that would be great, thanks @evanhalley

jnioche commented 4 years ago

Hi @evanhalley, crawler-commons 1.1 should be released soon but there is a release candidate for it - just in case you fancy working on this issue ;-)

evanhalley commented 4 years ago

@jnioche sounds great, I'll continue by work on implementing this enhancement

jnioche commented 4 years ago

Depends on #807