DigitalCommons / mykomap-monolith

A web application for mapping initiatives in the Solidarity Economy
0 stars 0 forks source link

[CWM] Add the IFFCO membership data to the CWM. #47

Open lin-d-hop opened 2 weeks ago

lin-d-hop commented 2 weeks ago

Description

Add the IFFCO membership data to the CWM. This spreadsheet lists all the datasets and their source files

This will involve:

Acceptance Criteria

ColmDC commented 4 days ago

https://github.com/DigitalCommons/cwm-test-data/issues/1

wu-lee commented 3 days ago

This a sub-issue of #54 I think...

wu-lee commented 2 days ago

Mapbox geocoded data now deployed here https://dev.data.solidarityeconomy.coop/iffco/

John recommended mapbox as being the better choice than geoapify for this dataset, at least.

It's pretty big, so takes hours to geocode!

ColmDC commented 2 days ago

Now that we have the data in the CWM we can see how many are way off!

https://dev.maps.coop/cwm/?datasetId=delhi

Select India in the Directory and it's easy to see about 70 spread across the rest of the world.

@John Can we check what confidence Mapbox gave for these ones? Is there a threshold where we can treat them as not having geocodes?

King-Mob commented 1 day ago

@ColmDC Mapbox only gives confidence levels when you give it a structured input and even then, only sometimes. I've tried for half an hour to get a confidence level for one of the IFFCO addresses without success.

Our data has states and districts though, as well as addresses. If you use this in the mapbox query then it will at least bind it to the district.

The way to do this in the code would be instead of everything being under the q parameter:

https://api.mapbox.com/search/geocode/v6/forward?q=55-17-2 TO 4, 5th Floor,VIJAYAWADA,ANDHRA PRADESH&access_token=${MAPBOX_API_KEY}

You have this, where you put them in fields, fields here:

https://api.mapbox.com/search/geocode/v6/forward?country=India&address_line1=55-17-2 TO 4, 5th Floor,&region=ANDHRA PRADESH&district=VIJAYAWADA&access_token=${MAPBOX_API_KEY

One thing I also noticed in the data is that some of the items are just Village: villagename e.g. Village: PULIMERU. GeoApify does these ones better than Mapbox

ColmDC commented 1 day ago

Set the Primary Activity for all of them to ICA10, i.e. Agriculture.

ColmDC commented 1 day ago

For clarity are you suggesting there would be value in rerunning the Mapbox Geocoder restructuring the query as illustrated above? @King-Mob ?

King-Mob commented 1 day ago

@ColmDC yes I am

ColmDC commented 15 hours ago

Is IFFCO now using the improved geocoded data?

wu-lee commented 1 hour ago

The datafactory gets something it interprets as a confidence from mapbox when it's using that for geocoding. Let me check... it uses a value called "relevance"

The docs here just say the results are "ordered by relevance" but don't mention it as an attribute. The geocoding is performed using the geocoder Gem, but the docs and source code for that don't say much about the "relevance" parameter except that I can see that it sorts results by it.

wu-lee commented 1 hour ago

Ok, found a page on Mapbox's site describing relevance:

https://docs.mapbox.com/help/getting-started/geocoding/

This suggests that relevance is a useful equivalent for confidence. We don't include the score value mentioned, but that seems to mainly be relevant for things which are commonly searched for, which may not add much to an Indian Agricultural Co-op address, as I think those will be rarely searched for. And our own geocoding might skew the results...

Result prioritization in forward geocoding

When a forward geocoding query (human-readable text like “San Francisco”) is submitted to the Mapbox Geocoding API, the geocoder applies filters on the backend to sort the results that match, or partially match, the searched text. These filters are textual relevance and prominence score. Textual relevance

When the geocoder sorts and prioritizes results, the first filter that is applied is relevance. relevance is a value that indicates how well a feature in our dataset matches the query. This property is surfaced in the Geocoding API response, and is a numerical score from 0 (the result does not match the queried text at all) to 1 (the result matches the queried text most completely). You can use the relevance property to remove results that don't fully match the query.

In this example query for the address “515 15th St NW, Washington, DC 20004”, the expected result is first in the response with a relevance of 0.875. Other results in other cities, since they don’t match the search text as closely, have a relevance score of 0.2.

https://api.mapbox.com/geocoding/v5/mapbox.places/515%2015th%20St%20NW%2C%20Washington%2C%20DC%2020004.json?types=address&access_token=YOUR_MAPBOX_ACCESS_TOKEN

Prominence score

In the case that multiple features have the same relevance score, a second filter called score is applied. This is based on the popularity or prominence of a feature. For example, a search for “Paris” will equally match “Paris, France” and “Paris, Texas” — they’ll have the same relevance score. The score filter helps break this tie on the backend, and surfaces “Paris, France” first since it is a more popular feature:

https://api.mapbox.com/geocoding/v5/mapbox.places/paris.json?access_token=YOUR_MAPBOX_ACCESS_TOKEN
wu-lee commented 1 hour ago

Is IFFCO now using the improved geocoded data?

Just to answer this, it's currently using the Mapbox data, but not attempting to do structured queries as described, because that would require ditching all geocoding code being used currently and inserting custom mapbox queries.

If the reason for this is to get confidence values, we already get them. Checking the data, the confidence frequency histogram, it's not totally obvious that GeoAPIfy is worse, although they do have different shapes (see below). [later] Actually, maybe GeoAPIfy is a bit worse, as it has several large spikes below 50%.

If the reason is to improve the accuracy, maybe it's worth it in a longer scale - not sure if the current timescale allows it. If we do that, perhaps the relevance can be used to assess that.

Mapbox

relevance count frequency bar
(NULL) 5
30 3
31 14
33 1
34 2
35 2
36 1
38 1
39 2
40 4
41 1
42 5
43 3
44 2
45 8
46 14
47 5
48 33
49 27
50 35
51 43
52 47
53 105 -
54 57
55 222 --
56 459 ----
57 305 ---
58 635 ------
59 624 ------
60 1399 -------------
61 1021 ----------
62 1883 ------------------
63 1555 ---------------
64 2952 -----------------------------
65 620 ------
66 4425 --------------------------------------------
67 996 ---------
68 1516 ---------------
69 2846 ----------------------------
70 1988 -------------------
71 762 -------
72 2407 ------------------------
73 701 -------
74 2094 --------------------
75 822 --------
76 693 ------
77 591 -----
78 936 ---------
79 491 ----
80 525 -----
81 381 ---
82 277 --
83 484 ----
84 109 -
85 72
86 20
87 57
88 16
89 34
90 10
91 19
92 5
93 10
94 12
95 2
96 6
97 12
98 22
99 23
100 164 -

GeoAPIfy

confidence count frequency bar
(NULL) 3770 -------------------------------------
0 7967 -------------------------------------------------------------------------------
1 206 --
2 113 -
3 96
4 68
5 96
6 61
7 63
8 52
9 33
10 77
11 60
12 448 ----
13 37
14 78
15 31
16 81
17 75
18 154 -
19 93
20 79
21 61
22 116 -
23 37
24 27
25 12145 -------------------------------------------------------------------------------------------------------------------------
26 15
27 26
28 21
29 6
30 22
31 27
32 18
33 37
34 16
35 14
36 26
37 27
38 5
39 5
40 82
41 13
42 12
43 6
44 21
45 255 --
46 10
47 13
48 17
49 7
50 3123 -------------------------------
51 94
52 80
53 76
54 107 -
55 81
56 159 -
57 169 -
58 83
59 59
60 861 --------
61 106 -
62 116 -
63 208 --
64 100 -
65 112 -
66 312 ---
67 96
68 96
69 84
70 126 -
71 90
72 109 -
73 71
74 61
75 90
76 59
77 55
78 45
79 38
80 401 ----
81 163 -
82 34
83 36
84 17
85 26
86 20
87 10
88 21
89 7
90 381 ---
91 10
92 15
93 10
94 29
95 52
96 12
97 1
100 687 ------
wu-lee commented 35 minutes ago

But then having said that, I've just ran the geoapify-data through the merge and convert process so I can look visually at the distribution if the IFFCO pins. And GeoAPIfy looks better as it doesn't put so many pins outside India.

This is with mapbox, i.e. the currently deployed data:

image

And this is with the GeoAPIfy geocoded IFFCO data:

image

the one outlier was geocoded using:

VILL TANGRA GRAM PO SUNDERPUR P.S. BONGAON NORTH24-PARGANAS WEST BENGAL INDIA

...and it has a geoapify confidence of 0.