OpenSextant / Xponents

Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.
Apache License 2.0
44 stars 7 forks source link

Gazetteer 2.0 -- Python ETL #56

Closed mubaldino closed 3 years ago

mubaldino commented 4 years ago

Type of Feature:

Description of Feature

Use Python Pandas and SQLite to stage all data sources in order to support the Merged Gazetteer output.
The current Gazetteer project is dependent on Kettle v6 to v9 and Java 8. There is now some incompatibility of the project with a git checkout on linux -- Kettle "spoon" script outputs an error on "Line 130, Column 69: Invalid Escape Sequence" ... but does not mention what file or what phase of processing.

This is not worth fixing in Kettle and Gaz project. Much easier to reimplement.

mubaldino commented 3 years ago

in v3.3, gazetteer scripts in ./solr/ were refactored heavily

mubaldino commented 3 years ago

release v3.4: SQLite-based gazetteer curation is far more fluid and manageable. Still relies on Merged gaz file, but for now this is a simpler means of integrating multiple data sets and avoiding the complexity of Solr plugins/extension