data61 / gnaf

GNAF geocoder and more
Other
42 stars 14 forks source link

Victorian addresses all over Australia #2

Closed rdengate closed 8 years ago

rdengate commented 8 years ago

I have plotted a list of dance clubs and calisthenics in Victoria (from Data.gov.au: https://data.sa.gov.au/data/dataset/dance-in-victoria-gfyl/resource/2bf45b00-3442-4fd0-b892-c5b1a3a22540) on Terria map.

Some addresses geocoded to other states and territories: vic_datapoints

In addition, the following addresses couldn't be found: 1/8 Blackburn Scout Hall,Blackburn,VIC,3130 Red Studio, Springers Leisure Centre,Keysborough,VIC,3173 Dandenong Workers Club,Dandenong,VIC,3175 4 Wilga Avenue,Selby,VIC,3159 Suite 2,Brighton,VIC,3186 Suite 1,Blackburn,VIC,3130 14 Wilson Avenue,Brunswick,VIC,3056 Studio 8,Hallam,VIC,3803 Fairway Receptions,Ardeer,VIC,3022 The Jessie Morris Community Hall, Corner Devon Road & Oak Street,Oak Park,VIC,3046 Studio 4/40 Green Street,Prahran,VIC,3181 Scout Hall,McKinnon,VIC,3204 1/21 Reserve Rd ,Melton,VIC Masonic Hall,Berwick,VIC,3806 243 Bay Street,Brighton,VIC,3186 Preston Neighbourhood House,Preston,VIC,3072 South Street,Wodonga,VIC,3690 Gate 2, Centenary Avenue,Melton,VIC,3337 Theo's Church Hall,Surrey Hills,VIC,3127 Unit 5, 15-19 Vesper Drive,Narre Warren,VIC,3805 Scout Hall,Heyfield,VIC,3858 Glengarry Hall,Glengarry,VIC,3854 Guide Hall,Morwell,VIC,3840

neilbacon commented 8 years ago

It's been very fruitful working through these cases, thank you Rebecca.

Summary of discussion with Rebecca:

The data.gov link above is no longer working, data attached: dance.csv.txt

For this data you'll get better results just using these fields in this order (with spaces between): "{Address} {Suburb} {State} {Postcode}" (the search rewards terms appearing in the correct order but still matches out of order). Also there's no point in trying {Address} if it doesn't contain a street address e.g. "PO BOX 23" or "Suite 1".

I think this test was done with a comma between each field. The tokenization we're using in Elasticsearch does not break on comma (only white space), so this would cause the bad results observed. In response I've changed the UI to substitute a space for a comma in the input.

The 2nd address in this data set, 405 CAMBERWELL ROAD, CAMBERWELL VIC 3124, is an interesting test case because the correct result was ranked 5th, with other street numbers in higher ranked results. There are inherent issues with scoring fuzzy matches. Because fuzzy matches are often spelling mistakes with low term freq they can be assigned a high score. To avoid this Lucene assigns the correct score for exact matches, but arbitrarily uses 1 for fuzzy matches and this must be responsible for the exact match scoring lower in this case. (I'd hoped it would use the term freq for the search term and discount the score according to the edit distance). This has been fixed by using fuzzy matching to find candidates then scoring them with exact matching.

There is more we can do, but this is getting into the territory of the law of diminishing returns (so don't worry about the following unless interested - we'd need an evaluation data set to verify that any of this really does help) ...

The other fields are more likely to do harm than good, but in some cases (I suspect only rarely) its possible that {Name} might be found in gnaf's BUILDING_NAME field e.g.: try "{Name} {Suburb} {State} {Postcode}" only if {Address} didn't work.

Another approach, since {Name} should only be matched against gnaf's BUILDING_NAME (or possibly ADDRESS_SITE_NAME) we can make a more specific query than that used in the UI, such as:

curl -XPOST 'gnaf.it.csiro.au/es/_search?pretty' -d '{ "query":{ "bool": { "should": [ {"match":{"buildingName":{"query":"Adrians Dance Classes","fuzziness":2,"prefix_length":2}}}, {"match":{"d61Address":{"query":"405 Camberwell Road Camberwell VIC 3124","fuzziness":2,"prefix_length":2}}} ] } }, "size":10 }'

and we can take advantage of the specific fields for state and postcode with:

curl -XPOST 'gnaf.it.csiro.au/es/_search?pretty' -d '{ "query":{ "bool": { "should": [ {"match":{"buildingName":{"query":"Adrians Dance Classes","fuzziness":2,"prefix_length":2}}} ], must: [ {"match":{"d61Address":{"query":"405 Camberwell Road Camberwell VIC 3124","fuzziness":2,"prefix_length":2}}}, {"term":{"stateAbbreviation":"VIC"}}, {"term":{"postcode":"3124"}} ] } }, "size":10 }'

We could do the same for suburb, but we'd need to take into account gnaf's LOCALITY_VARIANT (for alternative or nearby names, which are already taken into account in the d61Address field).

As for the list of addresses not found, the first 3 contain organisation names and not street addresses, so we can't expect these to work well with an address database like gnaf. We'd have to add a database of organisation names and addresses to make these work. These will suffer from the same spurious matching and comma issues mentioned above. Nevertheless, gnaf does have some organisation names in the BUILDING_NAME field.

Your 1st example: "1/8 Blackburn Scout Hall,Blackburn,VIC,3130" searching for "Blackburn Scout Hall" or "Blackburn Scout Hall,Blackburn,VIC,3130" does get the correct result "SCOUT HALL, 20 MCCRACKEN AVENUE, BLACKBURN SOUTH VIC 313"; so in this case it is the erroneous input "1/8" that is causing it not to be found.

Your 2nd example "Red Studio, Springers Leisure Centre,Keysborough,VIC,3173" works fine for me.

Your 3rd example "Dandenong Workers Club,Dandenong,VIC,3175" isn't in gnaf; some worker's clubs are in the BUILDING_NAME field, but not Dandenong's.

Your 4th example "4 Wilga Avenue,Selby,VIC,3159" is fixed by changing commas to spaces as previously discussed.

Thank again for useful test cases!

rdengate commented 8 years ago

Thanks Neil!