DigitalCommons / mykomap

A web application for mapping initiatives in the Solidarity Economy
3 stars 0 forks source link

[CWM] Design extension of NCBA map to include Co-ops UK Data #249

Closed ColmDC closed 1 month ago

ColmDC commented 2 months ago

Track in clockify under 'Cooperative World Map'

Background

As a first stage of the Coop World Map project we need to demonstrate combining a couple of data sets into one map. This map will serve as a demonstration before commencing the full project.

We need a map which includes all the co-ops in:

The number of filters supported in the NCBA map is already unwieldly, so we can't simple append the data and add a few more filters for the unique Co-ops UK fields. So we need to explore what merge rules we need to incorporate the co-ops uk and how we approach filtering and what data appears in the pop up dialog. Thus this issue is to create a proposal for how the datasets merge.

Note: conference this will be unveiled at is on 11th June, so aim to get done the end of the week before.

Description

Merging the categories will be as follows:

Then we can use the DemoMergeMap as the basis for the filters and for how the data will be displayed in popups.

Acceptance Criteria

wu-lee commented 1 month ago

I interpret this as the following, am I roughly correct?

lin-d-hop commented 1 month ago

I guess I'm confused. As both sets of data are already in MykoMaps I'm surprised that we can't use the MM format to merge them together. In the linked maps they look like the data is already aligned.

I feel like it would help me get my head around these complexities that I don't understand by looking at the actual datasets. Is that possible?

wu-lee commented 1 month ago

You can look in the data directory at this branch of demo-merge-map, which is currently where the merging is done for the demo-merge-map-plus case.

https://github.com/DigitalCommons/demo-merge-map/tree/demo-merge-map-plus/data

In there:

The latter all include the "standard" schema headings as follows, but each has some extra fields with information that doesn't fit into these headers, which depend on the dataset. The "standard" schema is defined/documented here.

  1. Identifier
  2. Name
  3. Description
  4. Organisational Structure
  5. Primary Activity
  6. Activities
  7. Street Address
  8. Locality
  9. Region
  10. Postcode
  11. Country ID
  12. Territory ID
  13. Website
  14. Phone
  15. Email
  16. Twitter
  17. Facebook
  18. Companies House Number
  19. Qualifiers
  20. Membership Type
  21. Latitude
  22. Longitude
  23. Geo Container
  24. Geo Container Latitude
  25. Geo Container Longitude

For the nitty-gritty of how the merge is done you can look at generate-db2.sh, although it might be painful to look at. The process took some time and experimentation to build up. It works subject so long as the assumptions it makes about the data holding true.

wu-lee commented 1 month ago

BTW If you want to inspect the sqlite database in the data/ folder mentioned above, I suggest installing sqlitebrowser from https://sqlitebrowser.org/ and using that on the downloaded file dc-ica-ncba-cuk.db. The output CSV file is a dump of the map_data view.

ColmDC commented 1 month ago

I've create a spreadsheet here

Does this cover the choices @wu-lee @lin-d-hop ?

ColmDC commented 1 month ago

I've create a spreadsheet here...

(Noted, that this document just looked at the fields exported from the database.)

There are a few issues which make the co-ops uk data distinct from the other sources.

1) There are several co-ops with use the same .coop domain for their website. We need a policy decision on how to address this first, and then work out how deal with it technically. (Some notes here.)

2) We don't as of yet automatically download and process the co-ops uk data. This needs to be done for the latest availalbe data, involving some manual steps.

lin-d-hop commented 1 month ago
  1. Would it help to first just make a set of the coops in both sets? To understand scale of overlap and if there are simple rules that will get us close.

  2. This is an additional acceptance criteria not mentioned in the original issue. I've added it.

Moving this to dev ready so that it can be progressed in a call with @wu-lee and @ms0ur1s and hopefully myself.

wu-lee commented 1 month ago

I've reviewed and tidied the generate-db2.sh script, and inlined the dump-csv.sh script, so there is now only one step to generating the CSV file, which is called dc-ica-ncba-cuk.csv

I've then made a first cut of incorporating the CUK data, and committed all this to the end of the demo-merge-map-add-cuk-data branch. @ms0ur1s - you'll need to copy the CSV file out into the right place for the map you're creating. (Or put your work on the end of the above branch)

Note, I resolve the problem of various CUK orgs which share the same .coop domain for unknown reasons, by simply ignoring any links to DC organisation and treating them as separate organisations, not linked to any DC, ICA or NCBA organisation. This might in fact not be true, but determining which this is will require human eyeball intervention.

(The database created for this has the same name, and can be inspected as described above)

ColmDC commented 1 month ago

Does this use the latest (2024-04) coops uk data?

ColmDC commented 1 month ago

Is this ball happily in your court now @ms0ur1s ? When might have a first draft to try out?

wu-lee commented 1 month ago

Does this use the latest (2024-04) coops uk data?

Not yet, it's 2024-01 - but it will when I publish that.

ms0ur1s commented 1 month ago

Is this ball happily in your court now @ms0ur1s ? When might have a first draft to try out?

All looking good, I'm using the new csv and showing the relevant coops uk data. No filtering as yet, but I just need a quick chat with @wu-lee.

I'm not about tomorrow, so eta sometime Thursday.

ColmDC commented 1 month ago

I'm not about tomorrow, so eta sometime Thursday.

It would be good to have this as soon a possible on Thursday, as we need time for QA, and to fix any issues that arise and to then show to DotCoop on Friday.

ms0ur1s commented 1 month ago

It would be good to have this as soon a possible on Thursday, as we need time for QA, and to fix any issues that arise and to then show to DotCoop on Friday.

No worries @ColmDC, @wu-lee will publish a version of the map so far, either later this evening or first thing tomorrow.

wu-lee commented 1 month ago

I've published it here:

https://dev.maps.solidarityeconomy.coop/qa/dc-ica-ncba-cuk/

That uses the CUK data from January. The latest CUK data is still generating - it's taking hours, which seems long even for for CUK.

wu-lee commented 1 month ago

CUK data finished building, after 5 hours. Deployed it now. From the file timestamps, it looks like it's just the final index.html which takes all that time!?

wu-lee commented 1 month ago

An attempt to list questions which might help point to some next steps. In no particular order.

Data integrity questions:

Cross-dataset correlation questions, for those organisations which appear in several:

Objective questions:

ColmDC commented 1 month ago

I've published it here:

https://dev.maps.solidarityeconomy.coop/qa/dc-ica-ncba-cuk/

That uses the CUK data from January. The latest CUK data is still generating - it's taking hours, which seems long even for for CUK.

Great. Taking a look now.

ColmDC commented 1 month ago

CUK data finished building, after 5 hours. Deployed it now. From the file timestamps, it looks like it's just the final index.html which takes all that time!?

I imagine easy to optimise the index gemeration at some point. But not now.

ColmDC commented 1 month ago

An attempt to list questions which might help point to some next steps. In no particular order.

That is a nice long list. I'm chasing DotCoop for priorities. My own candidates are... Add French vocab, as it's a Canadian conference. Do a quick profiling of upload time. Are there any candidates for easy speed ups there? Can we use the new version that displays the # entities filtered? Can we make country the primary directory filter category? Can we move memberships in the popup to the bottom.

wu-lee commented 1 month ago

Add French vocab, as it's a Canadian conference.

Ok, much of the translations are in the vocab files borrowed from elsewhere. I've added French translations for the ones which aren't (UI specific terms).

Do a quick profiling of upload time. Are there any candidates for easy speed ups there?

Deferring this, as I'm not sure if there any quick wins here. Requires profiling, which in itself is a fiddly thing, and any optimisations I could add would potentially also add bugs.

Maybe the map can be on a page loaded ahead of time if it's a conference, and simply revealed?

Can we use the new version that displays the # entities filtered?

Yes. Although the version will still say "3.1.6" pending a conferral with @rogup about tagging a new release.

Can we make country the primary directory filter category?

Country according to which dataset? There's no aggregated field so far. Will check how correlated they are.

Can we move memberships in the popup to the bottom.

Currently these "member of" fields intentionally precede fields from that data set. So "Member of ICA? [ICA fields here]. Member of NCBA? [NCBA fields here] ..."

Are you happy to disconnect these things? Possibly this is not clear and needs clarification anyway?

New version deployed with 1st and 3rd change included, at the URL above.

ColmDC commented 1 month ago

Quick off the draw. I was actually noting candidates rather than making requests there.

  1. Nice to see the French. Hope they don't complain that its not Qubequoi!
  2. A imagine optimisation is not in the tweak category.
  3. Looks like something config missing in the layout of the search results panel. Can you check you see it too.
  4. ...Country according to which dataset? There's no aggregated field so far. Will check how correlated they are. PLEASE
  5. ...re you happy to disconnect these things? Possibly this is not clear and needs clarification anyway? NEEDS A BIT MORE REFLECTION, YES
wu-lee commented 1 month ago

Re. point 4 - country correlation. They seem pretty correlated, if my SQL is up to scratch. (Applied to the latest DB, which currently isn't published, but corresponds to the current standard.csv here, which includes the latest CUK data.)

/*
Find cases where the non-null country IDs of organisations differ.

First get a table mapping organisation IDs to their country ID, from all datasets.
i.e. select id, cid for all cases where cid is not null

Then group by ID and count those with more than one entry (because the ID has been listed against more than one Country)

Here we include the Name field as a convenience
*/

select distinct count(*) as c, group_concat(cid) as cids, id, name from (
  select `DC Country ID` as cid, Identifier as id, Name as name  from map_data where cid is not null
  union
  select `ICA Country ID` as cid, Identifier as id, Name as name  from map_data where cid is not null
  union
  select `NCBA Country ID` as cid, Identifier as id, Name as name  from map_data where cid is not null
  union
  select `CUK Country ID` as cid, Identifier as id, Name as name  from map_data where cid is not null
) group by id having c > 1

Gets:

c cids id name
2 GB,HK demo/plus/dc/KPv5dW Platform Cooperativism Consortium Greater China
2 GB,JE demo/plus/dc/W9yYYW Channel Islands Co-operative Society
2 GB,IR demo/plus/ica/1465 Iran Chamber of Cooperatives (ICC)
2 PR,US demo/plus/ica/41 Liga de Cooperativas de Puerto Rico (LIGACOOP)

So only these four non-correlated cases?

Query for selecting those:

select * from map_data where Identifier in ('demo/plus/dc/KPv5dW', 'demo/plus/dc/W9yYYW', 'demo/plus/ica/1465', 'demo/plus/ica/41')

Result is a bit large to paste here.

wu-lee commented 1 month ago

Ok, a) I am now will soon be publishing the merge database on this URL, and b) here is selected view of the uncorrelated orgs with some extra fields:

c cids id name DC Country ID ICA Country ID NCBA Country ID CUK Country ID DC Domains ICA Website NCBA Domain CUK Website DC Name ICA Name ICA Street Address ICA Locality ICA Territory ID NCBA Name CUK Name CUK Street Address CUK Locality
2 GB,HK demo/plus/dc/KPv5dW Platform Cooperativism Consortium Greater China HK     GB platformhk.coop     http://www.platformhk.coop Platform Cooperativism Consortium Greater China           Hong Kong Platform Co-operative Crown House 27 Old Gloucester St London
2 GB,JE demo/plus/dc/W9yYYW Channel Islands Co-operative Society GB     JE ci.coop;cics.coop;cicstest.coop;channelislands.coop;ci-memberportal.coop     http://www.channelislands.coop/ Channel Islands Co-operative Society           The Channel Islands Co-operative Society Co-operative House 57 Don Street St Helier
2 GB,IR demo/plus/ica/1465 Iran Chamber of Cooperatives (ICC) GB IR     stories.coop;registry.coop;iran.coop;paulinegreen.coop;identity.coop;domain.coop;domains.coop;directory.coop;bobburlton.coop;directorio.coop;webuildtogether.coop http://www.iran.coop     Domains.coop Limited Iran Chamber of Cooperatives (ICC) NO.83, Sepahbod Gharany Ave. PO Box 1583614111 Tehran IR        
2 PR,US demo/plus/ica/41 Liga de Cooperativas de Puerto Rico (LIGACOOP) PR PR US   liga.coop http://www.liga.coop liga.coop   Liga de Cooperativas de Puerto Rico Liga de Cooperativas de Puerto Rico (LIGACOOP) P.O. Box 360707 San Juan PR Liga de Cooperativas de Puerto Rico      
wu-lee commented 1 month ago

From that I think I conclude:

I'm not sure if we should patch these amendments into our data somehow, and/or just use the Country ID from (in order of preference): CUK; ICA; DC; NCBA.

wu-lee commented 1 month ago

I've gone for the latter for now, which seems adequate.

Data updated, see CSV / SQLite3 database published.

Map altered to include the new Country ID field, and use it for the directory. Published.

wu-lee commented 1 month ago

3. Looks like something config missing in the layout of the search results panel. Can you check you see it too.

Not sure if I understand what you mean here?

ColmDC commented 1 month ago
  1. Looks like something config missing in the layout of the search results panel. Can you check you see it too.

Image

ColmDC commented 1 month ago
  1. Looks like something config missing in the layout of the search results panel. Can you check you see it too.

After a forced cache refresh I am now struggling to repeat it.

Let's leave it. If it reappears I'll create a fresh ticket for it.

wu-lee commented 1 month ago

That looks different to what I see. I don't have bullets; also no icons at the top of the black panel.

Marcel and I were thinking we should have a quick call to discuss this and any other next steps? (See Element)

rogup commented 1 month ago

R.e. the changes not appearing without a hard refresh, I think we should implement this fix on Mykomap to solve the problem: https://github.com/DigitalCommons/land-explorer-front-end/pull/220/files

I'll make a ticket

wu-lee commented 1 month ago

Ok, there's an update published - same URL as above.

As discussed yesterday, @ms0ur1s and I have stripped down the fields to the minimum necessary, including removing the description and stripping off the duplicated base URI in front of the primary activity values. This brings the size of the CSV down from ~8.5 MB to ~2.2MB, and loading does seem to be faster (although still order of 8-10s)

The main filterable categories are Country, and Primary Activity. The latter is new, composed from the Primary Activity fields from the various datasets by selecting the first available from ICA -> CUK -> DC (there's no NCBA value; CUK and DC both only have a hackily synthesised value).

Also filtered are the membership fields. The pop-up shows these.

The pop-up also shows the geocoded address field, which like the other location fields is now selected in preference from ICA -> CUK -> DC -> NCBA. (I don't remember the exact ordering if any from the meeting but my reasoning is that the NCBA addresses are the worst, ICA the best if available, CUK quite good but has a lot of organisational names in them which is bad, DC comprehensive but so-so and orgs still conflated with domain registrants).

The pop-up's size has been adjusted to avoid wrap and scroll, without being too big.

French language support has been added.

Mykomap 3.1.7 has been used so that the organisation totals are shown in the filter panel.

Website links are shown in the pop-up, these are selected from ICA -> NCBA -> CUK. (DC has no website, just domains, which often don't correspond to websites; likewise NCBA, which means it shouldn't be included in that list above, but it doesn't matter as they're always null). DC Domains are shown separately, and the NCBA domain is not, but if present will always match one of DC's.

wu-lee commented 1 month ago

@wu-lee Is it because the map has custom CSS that is overriding Mykomap's styling?

Do you mean the difference between what Colm saw and what I did? Possibly, but with some interplay with caching, as you say.

Also, r.e. the changes not being pulled without a hard refresh, I think we should implement this fix on Mykomap to solve the problem: https://github.com/DigitalCommons/land-explorer-front-end/pull/220/files

This is something I have already tried to fix, more than once and some time ago, but the usual solutions assume webpack is generating the HTML which includes the JS/CSS files, and inserting a digest in the name of the JS/CSS files it generates. (They have to correlate.) In this case the problem is that the HTML can't easily be regenerated by webpack, because it can in general have all sorts of custom bits and pieces in it we don't want to clobber. Whatever does it would have to find and munge the existing includes in the HTML source code of an arbitrary file or files (not necessarily just index.html), using a rule which allows it to spot the right filename(s) to alter. Also the HTML author would have to know about this and not break that rule by using a filename which confuses the munger.

Working around that just got a bit fraught with complication, and so far I didn't find a satisfactory way to do it vs the time it was taking to research / implement.

Does your solution for landexplorer get around this issue?

wu-lee commented 1 month ago

A comment on Mykomap speed with large datasets.

Reducing the size of the CSV seems not to be the main issue, as checking in the network tab of my browser I see a 2MB file takes only ~370ms to load, so an 8MB file (which is what it was before we stripped the superfluous fields and the descriptions out) would still only be about a second or so. (There are 12k rows in that currently - of course, this is still an order or magnitude or two lower than our stated ambitions.)

So 1/3 second to load the data, whereas the map spinner is still spinning after 8-10 seconds.

I had a quick go at profiling this. Which certainly generates a lot of information (see screenshot, showing 27% out of ~50% of the sampled time spent in addLayer, a Leaflet marker function), not all of which is easy to interpret, but the main thing I think I can establish is that a significant amount of time is spent in the parsing and creation of map markers. Creating markers involves creating DOM elements, which is known to be relatively slow and is what the marker clustering we use is designed to minimise. But it's still slow... that article on optimising Leaflet mentioned in one of the Element channels (this one) touches on this, and recommends using a different clustering plug-in. Other sources on the internet suggest using canvas-rendered maps, or WebGL (2, 3, 4, 5).

Screenshot_20240607_114551

ColmDC commented 1 month ago

V useful observations. Should we move them into one of the tickets Lynne has just created?

ColmDC commented 1 month ago

For the map for next week, does that mean we could probably add back in a few fields without having much impact? Candidates would be description, Org Type, Typology and Co-ops UK's own Sector - Simplified, High Level?

ColmDC commented 1 month ago

The title for the filter for the Co-ops UK data, says Co-ops UK Member, which is not right. Can we change to Co-ops UK Open Data - Données ouvertes de Co-ops UK

wu-lee commented 1 month ago

For the map for next week, does that mean we could probably add back in a few fields without having much impact? Candidates would be description, Org Type, Typology and Co-ops UK's own Sector - Simplified, High Level?

[Assuming "Org Type" means the ESSGLOBAL "Org Structure" category, and not one of CUK's categories]

The Description field is a field which was there before, and restoring it is literally reverting some code.

On the other hand, the two ESSGLOBAL fields were only present in the per-dataset forms, with the fields being original sourced data only for ICA. In DC and CUK, these fields are just tentatively inferred, derived from the approximately-equivalent categories in those datasets. I will need to do the same trick of creating a new combined field for each case, using a fallback rule. Then I could include these fields sensibly, using a single filter drop-down for each, rather than one per dataset.

On the other hand, the latter field "Sector - simplified" is inherently CUK only, so just a reversion [edit: except that I see it wasn't there], but: it isn't represented as a proper vocab with IDs and translated phrases - it's just an "ad-hoc" vocab currrently. Which means we can't put it in the filters currently, as the current Mykomap version doesn't support it. [edit - nor will it be localised in the pop-up]

wu-lee commented 1 month ago

Ok, an update:

The CSV file now takes about 0.5 seconds to download. The map still takes 10s of seconds to render fully.

Will look into adding composite fields for the two ESSGLOBAL vocabs next.

https://dev.maps.solidarityeconomy.coop/qa/dc-ica-ncba-cuk/

wu-lee commented 1 month ago

Ok, I've now added combined versions of Organisational Structure, and Typology. Selecting the relevant fields from, in order of preference, ICA -> CUK -> DC datasets. (NCBA dataset doesn't have these fields)

These are shown both on the pop-up and the filters.

ColmDC commented 1 month ago

Nice. Okay. Closing this ticket as they will be starting doing demos today. Any new work on this should get a new ticket.