Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Unparseable date #244

Closed V3RITAS closed 8 years ago

V3RITAS commented 8 years ago

I would like to parse the Last-Modified date so it fits the format expected by the Solr TrieDateField class:

YYYY-MM-DDThh:mm:ssZ

(https://cwiki.apache.org/confluence/display/solr/Working+with+Dates)

As far as I can see the Last-Modified date looks like this:

Mon, 14 Mar 2016 12:39:57 GMT

I'm trying to use the DateFormatTagger to transform the date:

<tagger class="com.norconex.importer.handler.tagger.impl.DateFormatTagger" fromField="Last-Modified" fromFormat="EEE, dd MMM yyyy HH:mm:ss z" toFormat="yyyy-MM-dd'T'HH:mm:ss'Z'" />

Unfortunately I get a ParseException like this:

WARN  [FormatUtil] Invalid date format for field Last-Modified.
java.text.ParseException: Unparseable date: "Mon, 14 Mar 2016 12:39:57 GMT"
        at java.text.DateFormat.parse(Unknown Source)
        at com.norconex.importer.util.FormatUtil.formatDateString(FormatUtil.java:83)
        at com.norconex.importer.handler.tagger.impl.DateFormatTagger.tagApplicableDocument(DateFormatTagger.java:98)
        at com.norconex.importer.handler.tagger.AbstractDocumentTagger.tagDocument(AbstractDocumentTagger.java:53)
        at com.norconex.importer.Importer.tagDocument(Importer.java:522)
        at com.norconex.importer.Importer.executeHandlers(Importer.java:350)
        at com.norconex.importer.Importer.importDocument(Importer.java:321)
        at com.norconex.importer.Importer.doImportDocument(Importer.java:271)
        at com.norconex.importer.Importer.importDocument(Importer.java:195)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:298)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:487)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:377)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:723)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

I'm pretty sure that the fromFormat must be correct (please prove me wrong) so I don't understand why the date can't be parsed.

Is it possible that this is a problem with the default locale on my system (I think it's German) as the Last-Modified date contains day and month names?

I also noticed that if I remove the DateFormatTagger and let the crawler write all data to my file system the Last-Modified date is saved like this (with backslashes before the colons):

Mon, 14 Mar 2016 12\:39\:57 GMT

Maybe I have to put the backslashes to the fromFormat string?

OkkeKlein commented 8 years ago

Maybe tryfromFormat="EEE, dd MMM yyyy HH:mm:ss 'GMT'"

V3RITAS commented 8 years ago

Thanks, but this doesn't work either. :-(

But I just did a test and changed it to:

fromFormat="'Mon', dd 'Mar' yyyy HH:mm:ss 'GMT'"

This one works fine with the date mentioned in my first post (Mon, 14 Mar 2016 12:39:57 GMT) so I guess this really has something to do with the day and month names and the locale.

Is there a way to manually define a locale for this tagger?

jetnet commented 8 years ago

a locale can be specified as start parameters, e.g.: -Duser.country=US -Duser.language=en

essiembre commented 8 years ago

I just released a new snapshot of the Importer module which adds the ability to supply a locale. Replace the norconex-importer-XXX.jar from your lib with the corresponding jar from new snapshot and give it a try.

To achieve constant behavior between platforms, the locale now always defaults to "en_US" for parsing/formatting dates (most common locale for date fields in document metadata). So it will likely work out of the box for you now, but if you encounter dates written in other languages, have a look at the new DateFormatTagger configuration usage to find out how to explicitly set the locale.

Basically two new attributes are now supported: fromLocale and toLocale.

FYI, the ability to pass a locale has also been added to CurrentDateTagger as well.

Please confirm.

essiembre commented 8 years ago

Thanks @jetnet, that's also a good way and I will give consideration to permanently add those to the launch scripts to guarantee constant behavior across environments. For now, I think the additions I just made will cover this and allows to have different source and target locales for those rare ones wanting this.

V3RITAS commented 8 years ago

Hi Pascal,

Thank you very much, it works perfectly now! It's really amazing how fast you can provide an update for the importer. :-)

Also thanks @jetnet, this could be helpful in the future!

essiembre commented 8 years ago

FYI, I just made a new snapshot release of the HTTP Collector that has this fix now.