dstl / baleen

Entity Extraction Text Processor
Apache License 2.0
147 stars 40 forks source link

Provide normalize cleaners #24

Closed ghost closed 8 years ago

ghost commented 8 years ago

This change request adds new cleaners and modifies existing cleaners and annotators to provide the capability to normalize certain types of annotation data. It includes the addition of an isNormalized flag in the Entity type to enable cleaners and annotators to record when data has been modified to meet a specific normalized format. This enables annotators, cleaners, and consumers further down the pipeline to decide whether to use normalized data or refer back to the original text in the document.

A summary of the file additions / changes is as follows:

Addition of isNormalised field

baleen-uima/src/main/resources/types/semantic_type_system.xml baleen-uima/src/main/java/uk/gov/dstl/baleen/types/semantic/Entity_Type.java baleen-uima/src/main/java/uk/gov/dstl/baleen/types/semantic/Entity.java baleen-uima/src/test/java/uk/gov/dstl/baleen/types/BaleenAnnotationTest.java baleen-consumers/src/test/java/uk/gov/dstl/baleen/consumers/MongoTest.java baleen-consumers/src/test/java/uk/gov/dstl/baleen/consumers/LegacyMongoTest.java baleen-consumers/src/test/java/uk/gov/dstl/baleen/consumers/LegacyElasticsearchTest.java baleen-consumers/src/test/java/uk/gov/dstl/baleen/consumers/ElasticsearchTest.java

New Normalization Cleaners

baleen-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/helpers/AbstractNormalizeEntities.java baleen-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/NormalizeDates.java baleen-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/NormalizeOSGB.java baleen-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/NormalizeTimes.java baleen-annotators/src/test/java/uk/gov/dstl/baleen/annotators/cleaners/NormalizeDatesTest.java baleen-annotators/src/test/java/uk/gov/dstl/baleen/annotators/cleaners/NormalizeOSGBTest.java baleen-annotators/src/test/java/uk/gov/dstl/baleen/annotators/cleaners/NormalizeTimesTest.java

Modified Cleaners and Annotators to support normalization changes

baleen-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/NormalizeWhitespace.java

baleen/baleen-annotators/src/main/java/uk/gov/dstl/baleen/annotators/regex/LatLon.java

baleen-annotators/src/test/java/uk/gov/dstl/baleen/annotators/LatLonDDRegexTest.java baleen-annotators/src/test/java/uk/gov/dstl/baleen/annotators/LatLonDMSRegexTest.java baleen-annotators/src/test/java/uk/gov/dstl/baleen/annotators/cleaners/NormalizeWhitespaceTest.java

jbaker-dstl commented 8 years ago

I've made a few small changes to the code in this pull request and opened a new pull request #25. Closing this pull request as it has been superseded.

ghost commented 8 years ago

James,

Thanks for checking the code and spotting those minor changes and need for additional tests. My apologies for not spotting them myself.

Regards,

Ian


Roke Manor Research Limited, Romsey, Hampshire, SO51 0ZN, United Kingdom.Part of the Chemring Group. Registered in England & Wales. Registered No: 00267550 http://www.roke.co.uk

Please update your address book. Roke is currently transitioning to its original brand and will no longer be branded under Chemring Technology Solutions. Email addresses of Roke staff have therefore been changed from firstname.surname@chemringts.com to firstname.surname.@roke.co.uk – please use this updated format with immediate effect.


The information contained in this e-mail and any attachments is proprietary to Roke Manor Research Limited and must not be passed to any third party without permission. This communication is for information only and shall not create or change any contractual relationship.


From: James Baker [mailto:notifications@github.com] Sent: 29 April 2016 09:15 To: dstl/baleen Cc: Sheppard, Ian; Author Subject: Re: [dstl/baleen] Provide normalize cleaners (#24)

I've made a few small changes to the code in this pull request and opened a new pull request #25https://github.com/dstl/baleen/pull/25. Closing this pull request as it has been superseded.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHubhttps://github.com/dstl/baleen/pull/24#issuecomment-215655452