SimonbJohnson / quickX3

HXLDash. Create data visualisations quickly by leveraging the humanitarian exchange language
https://hxldash.com/
MIT License
8 stars 4 forks source link

Necessary steps to implement a new country without p-codes (like Brazil) #66

Closed fititnt closed 3 years ago

fititnt commented 4 years ago

TL;DR: this question is both about how to add the geometry on the hxldash and how to choose p-codes for countries that already do not have p-codes on OCHA... BUT may already have some well used codes outside humanitarian area. These P-codes are also more resilient than use IDs based on descending order of cities names or names of cities (that may be either in English, or in the local language and have characters like ç )


I just discovered about the hxldash.com yesterday after looking for visualizations to datasets following HXL Standard. On our case, we're looking to eventually implement visualizations on some CPLP countries. The hxldash seems to already support 2 CPLP countries (AGO/Angola) and (MOZ/Mozambique). The other two big countries from CPLP that hxldash already does not have one optimized way to implement is Portugal and (near continental size) Brazil.

Looking at both the HXL Standard and the hxldash source code it seems that be best way to document a location would be use p-codes (I'm actually new to HXL Standard, so I'm not sure if already have p-codes is and enforcement to use hxldash or something that make it easier to use but could be replaced by other verbose means; if there is actually other ways, we're also interested to know!).

As for this issue, I will try first consider just Brazil, because for very short term I both already know where to get the geometries and also am aware of well used internal codes to point to places. For Portugal and other CPLP countries I would be less confident.

Point one - the geometry optimized for web

Question 1.1: the format used is topojoson?

Question 1.2: which minimum metadata should the geometry have to use on hxldash?

Using Brasil as reference,

In special for the city geometry, I would need to regenerate from the updated shapefiles, since the repository gis-dataset-brasil is not up to date

Point two - P-Codes to use on hxldash

I suppose that the first step to look for P-Codes would be try to find already established codes.

This is an example of very explicitly codes already defined:

Then, we have Brazil. At the https://data.humdata.org/, it actually have shapefiles (here https://data.humdata.org/dataset/brazil-administrative-level-0-boundaries) from admin0 to admin3. But these files, while have name of the regions in portuguese, they don't explicitly set an P-Code.

At least using Brazil as reference, both for what data.humandata.org already have (the BRA_admin1, BRA_admin2, BRA_admin3) on Brazil the IBGE (https://www.ibge.gov.br/, https://en.wikipedia.org/wiki/Brazilian_Institute_of_Geography_and_Statistics) already have an unique identifier for each of these BRA_admin1, BRA_admin2, BRA_admin3.

For a quick list of IBGE codes, there is this link https://www.ibge.gov.br/explica/codigos-dos-municipios.php.

But, in short, as example, the code for the a city, from the state level to other divisions, would be

  • 31 is the code for the state of "Minas Gerais" (and alternative is "MG")
  • 3107 is the code for the mesorregiao "metropolitana de Belo Horizonte"
  • 31030 is the code for the microrregiao of "Belo Horizonte"
  • 3106200 is the code for the (capital) city "Belo Horizonte"

The IBGE codes are used on a huge number of internal systems on Brazil, so any change are likely to be also a change on region borders. Old codes tend to be not reused. Using the IBGE codes as quickstart to P-codes for Brazil also have the advantage of every shapefile that is released each year also have these same codes.

If using the https://data.humdata.org/dataset/brazil-administrative-level-0-boundaries way, it seems that for #adm1 if usesHASC_1: BR.MGfor Minas Gerais State and Belo Horizonte city theHASC_2is empty, andID_2: 1606` is just an sequential ID based on desc order (very likely to change by just adding/removing a city)


My initial proposal

So, just to finish my initial thought here, to generate the geometries for hxldash, it seems a good idea to reuse the IBGE codes with a prefix for the country (I'm not sure if most P-codes use the 2 leter iso or the 3 letter iso) But assuming most P-codes uses 2 letter iso, this would means:

If other P-Codes already use iso3, or there is a thrend to use iso 3, then on this example would be #adm1: BRA31 and #adm2: BRA3106200.

If you are Ok with this, I could try help generate the geometry files and open a Pull request. But definely how the "admin2Pcode" value was decided was something I would ask opinion before and give strong reasoning to use the IBGE codes after the country prefix.

SimonbJohnson commented 4 years ago

Hi Fititnt, answers below:

Question 1.1: the format used is topojoson? Yes, the format is topojson. If the topojson is above 1mb, I try to simplify the geometry to reduce file size. Happy to help with any of this process.

Question 1.2: which minimum metadata should the geometry have to use on hxldash?

Meta data for the file is included here: https://github.com/SimonbJohnson/quickX3/blob/master/hxldash/static/libs/hxlbites/hxlBitesMap.js#L67 iso3: iso3 code - BRA iso2: iso2code - BR use: which iso code is used in the geometry file - BR url: url of file - /static/geoms/topojson/{{country}}/{{level}}/geom.json' adjustment: redundant variable code_att: pcode - admin{{level}}Pcode name_att: name - admin{{level}}Name levels: levels available for the country

Some of these variables have been standardised or become redundant. Previously I pulled live from external web services, but decided to pull the geom formatting into the project to have more control.

There are only two fields needed within the file, the admin name and p code. The fields taking the headings of admin{level number}Name and admin{level number}Pcode e.g. admin1Name and admin1Pcode.

Question 2: Your proposal is spot on and aligns with the current method when setting up a new country (adopting any in place coding system and prepending the ISO2 code)

I have capacity to help with this to get it up and run quickly if you needed. Thanks for looking into it!

fititnt commented 4 years ago

Perfect! I guess we will take at least 48 to 72h to back to this topic. Let me explain.

We will also take the opportunity and started a draft of dockerized hxldash for local testing/development at covid-taskforce-cplp/hxldash-docker so it may make it easier for our peers also take some frontend tasks and quick fixes on the next weeks.

From the previous experience on uwazi-docker (and the stack of the hxldash very likely be less complex than the Uwazi even if we have to ship together the hxl-proxy; as reference Uwazi requires NodeJS, some OS dependencies, MongoDB and ElasticSearch) it may actually not take too much time on the docker part than would be explaining to my peers how to install the hxlstack if they already not know python and read all the documentation.

The only downside of the a hxldash-docker is that definitely is not optimized for low bandwidth. But at some point it may be possible to document an step of the user go to a fast/not limited internet, download and build docker images, and then do near or full offline editing.

Some automated full stack testing may actually not be too far from near offline development. Not that this can be done on a weekend, but actually is possible. I'm not sure if already exist something to emulate google spreadsheet API than simply cache some results, but as reference that are ways to even "emulate full S3 API with softwares like https://github.com/jubos/fake-s3. But just to you get the idea, if you on next weeks can have some command that could populate the database with fake but predictable data that is on the same repository (a command to run after the python manage.py migrate on a clean install), it would be perfect for automate things.

SimonbJohnson commented 4 years ago

This all sounds great, thank you so much. In regards to the HXL proxy and google sheets API, it should be reasonably easy to write an alternative offline approach. The HXLProxy is mainly used to access online data sources and format the data. Most of it's functionality is not utilised here, so it is possible for me to write a small equivalent to load CSVs for the offline environment. I will take a first look this weekend. The populate database command could then have an offline/online version.

This also fits in with future ideas that we want to support drag and drop of data files.

fititnt commented 4 years ago

fititnt 2 days ago Perfect! I guess we will take at least 48 to 72h to back to this topic.

I started to have discussions with someone from our group to review the Brazilian P-Codes (and also the most up to date Shapefiles and IBGE codes). This may take more time, but may be good on medium term, since may get more people on this topic

SimonbJohnson 1 hour ago This also fits in with future ideas that we want to support drag and drop of data files.

This actually is a big feature.

I thin I will open a different issue just to comment the part of a new python manage.py migrate that can actually some sample data. There I may explain the most simple minimum viable product to add sample data that may be both useful for who is testing interface (add CSS, JS, testing small changes) and may be very useful for later be the base of automated testing.

Edit: by This actually is a big feature. I mean that the upload CSV may take more time to make, and the on the new issue I could explain that is more about add data just after installing the software.

Rydela commented 4 years ago

Hello @fititnt just to note, I'm currently creating these p-codes for Brazil, with a lot of help from your guidance. Currently reviewing whether to go ahead and implement 3 levels since a district level in between state and city may eventually be needed.

After that, I'll add Brazil to the hxldash repository with a topojson file to go along with it.

fititnt commented 4 years ago

@Rydela sound great to me. It could take some weeks to me to prepare someone to get involved in. I'm ok to only do a quick review of the topojson files (like take some samples and do full review).

Some quick links

Shapefiles per level from IBGE

Note: all these shapefiles have an database file that already have the IBGE codes, so there is some way to generate the files with this metadata; following the previus discussion, this would means prefix the IBGE codes with "BR", but beyond the conversion and optimization of the final size, there is the problem to decide what IBGE level of division means as for an adm1/2/3/4/+.

Rydela commented 4 years ago

Brazil UF/ADM1 is finished. I've also finished Municipio/ADM2 but the .topojson is still quite large. I'm worried if I simplify the file any more the areas will lose a lot of their shape, but I may have to do that. Current size is 8.8mb, which is a lot larger than we normally want.

fititnt commented 4 years ago

Great. 8.8mb (as for the raw filesize) could still be improved.

Question: is "Municipio" (e.g. city) adm2 or adm3?

I'm asking just to be sure what what type of adm would be city level.

About size

I believe we still have room for size reduction to a final filesize around 3.6MB (compressed: <700kb) and 1.5MB (compressed: < 350kb).

Around 7 years ago I worked on a project related to an digital atlas from data related to solidarity economy. While the original files would likely to be outdated for at least a few of the 5000 cities, I remember testing with other people the ideal files sizes.

Less or equal to this precision

I think that anything with precision higher than what was used on the file "municipio.json" on this folder https://github.com/fititnt/gis-dataset-brasil/tree/master/municipio/topojson is likely definitely not add value.

Here one example that uses it (files higher than 1 MB github does not give preview):

Captura de tela de 2020-05-04 00-01-56

Image from http://atlas.sies.org.br/?q=are14&l=0&g=municipios (click on "Ver gráficos")

topojson ../shapefile/Munic.shp -o municipio.json --id-property=+GEOCODIGO --p name=NOME,uf=UF,codigo=+GEOCODIGO,regiao=REGIAO,meso=MESOREGIAO,micro=MICROREGIA

Not less than this precision

Just to give an lower precision that do not worth use would be this reference.

Captura de tela de 2020-05-04 00-11-42

https://github.com/fititnt/gis-dataset-brasil/blob/master/municipio/topojson/municipio.min-sn-q250.json

topojson ../shapefile/Munic.shp -o municipio.min-sn-q250.json --id-property=+GEOCODIGO -q 250

fititnt commented 4 years ago

@SimonbJohnson @Rydela just to mention, I'm likely to eventually open an issue related to document how to implement Brotli Compression at server level. I guess we're likely to save at least 10% than just gzip.

I just don't already done it because I will do some additional research or maybe test on some private client before to check some corner cases, but project like HXL Dash is the perfect case for such implementations.

But in case of Brotli Compression, this is likely to be documentation on extra code to use on NGinx or Apache, not a change on the HXL source code.

Rydela commented 4 years ago

@fititnt Municipio is currently ADM2 with plans to make distrito ADM3 if needed. Thank you for the spatial date links, I'll check them out.

SimonbJohnson commented 4 years ago

@fititnt - I've not heard of Brotli compression before.

It would be great to implement on the main server https://github.com/google/brotli

SimonbJohnson commented 4 years ago

@Rydela - I've updated the production server https://hxldash.com/. Can you check that Brazil ADM1 is working as expected. Cheers

fititnt commented 4 years ago

@SimonbJohnson sure! in the next days I will do some demo with BR ADM1 and maybe already leave and spreadsheet with ADM2.

Rydela commented 4 years ago

@SimonbJohnson I can confirm Brazil ADM1 is working as expected. Test dashboard below:

https://hxldash.com/view/329

image

I will work on ADM2 next.

fititnt commented 4 years ago

@SimonbJohnson @Rydela just a quick update on the brotli: I'm upgrading some production servers from a private client, so I will do some tests if this can break something. But depending of how the HXLDash.com servers are configured, enable is likely to not require more complex configuration.

One example I'm not 100% sure is how intermediate caching servers (something that a school, a company or an internet service provider could add to speed up not use bandwidth) can, for example, decide to cache a "brotli" content and serve even for a browser that do not support brotli.


hxldash.com/view/329 seems perfect. Even the 'á' and 'ã' are working as expected

fititnt commented 4 years ago

TL:DR: if using Apache without nothing in the front that could cache it should works out of the box. The Apache would only send content encoded with brotli if the end user browser says that it accept 'br' (e.g. Accept-Encoding: gzip, deflate, br for chromium browser)


Just a quick response until I go back with more testing: if the webserver is Apache (the headers of hxldash.com shows Server: Apache/2.4.29 (Ubuntu)) and the version is 2.4.26 and later it is likely to be easier to implement.

At this very moment I can confirm that, without extra compilation step, NGinx Open Source does not support out of the box brotli (the NGinx Plus have, but is way expensive, even for some more enterprise users). Also Varnish (either the open source or the paid one) does not have any way to encode to brotli. The good news is that most of the time well behaved cache proxies, because of old times when some browsers does not support gzip/declare and have to differentiate the cache) should somewhat also somewhat works fine with the new brotli

I'm even considering eventually add Apache behind NGinx/Vanish just to avoid compilation on private clients that would just put NGinx. (Also Apache support mod_pagespeed without compilation and this is a huge plus if don't have a dedicated team just to deal with infrastruture, but this is another discussion; but definitely mod_pagespeed can make more space savings that are very optimized for the end user at cost of some CPU increase).

Rydela commented 4 years ago

Requested from @SimonbJohnson in the recent pull request, here is adm2 levels being visualised in HXLDash:

image

Its on my local server, so I did not submit/POST the creation of the dashboard but you can see it works in the preview.

With adm1 & adm2 for Brazil complete, the last step is to add it to the list of countries that are available on the HXL Dash website. After that, perhaps this issue can be closed and I'll create documentation on how to add new countries.

SimonbJohnson commented 4 years ago

That's great. Thanks Ryan. I'll pull the changes onto the server this week.

fititnt commented 4 years ago

and I'll create documentation on how to add new countries.

Fantastic!

SimonbJohnson commented 4 years ago

I've updated the server, if someone could check BRA adm2 that would be great as I don't have access to a test data set. I'll leave this issue open for Ryan to share the documentation on adding new countries

Rydela commented 4 years ago

Hey @SimonbJohnson,

I've linked the test dashboard here: https://hxldash.com/view/341

Here is a screenshot:

image

I'm working on writing up the documentation now.

SimonbJohnson commented 4 years ago

That works great and the file is showing as around 3mb currently.

fititnt commented 3 years ago

Thank everyone! It was merged months ago, but in addition to close this issue, I will add some notes.

This issue was started because of @covid-taskforce-cplp, but we part of us liked the HXL Standard so much that we eventually founded the @HXL-CPLP as part of one Hackathon (that we failed HXL-CPLP/forum#16, but the work of HXL-CPLP was at least bootstrapped).

At this point part of us from @HXL-CPLP (mostly me) are in direct contact with the the HXL international community (including the ones who helped to define the HXL Standard), but even if not on very short term, we're already trying to bring more people from CPLP to cooperate with the HXL ecosystem.

Eventually the Covid-19 (and the main point of the @covid-taskforce-cplp) is likely to be just history, but the the part related to HXL (while initially was to see some way to how to have nice dashboards) actually in evolving.

On preparedness

In think one strong lesson that cannot be cannot be justified as an software issue is "preparedness".

I mean, even with an huge open source community effort to help on the Covid-19 crisis here in Brazil, and even with massive amount of software developers willing to cooperate. As reference:

The brasil.io / turicas/covid19-br

Just to explain my initial interest on HXL is that (while maybe not get the credit that deserved) the most successful project that got working was this one, mostly coordinated by @turicas

From all other collaborative work that happened on Covid-19 on Brazil, if something was really successful and that was used by so many other projects when even the local government failed to compile data, was this one. The Brazil IO used Google Spreadsheets to get humans volunteers to update the data when still not existed automated crawlers to make data mining.

I'm mentioning the Brazil IO success not that I was part of their team on this crisis, at some point on @covid-taskforce-cplp (and the fact that @EticaAI already was also in Mozambique and Angola, and the The Humanitarian Data Exchange (https://data.humdata.org/) was a great place to get structured data, sometimes even easier than would be in Brazil) that even if start to use HXL tools could not work in very short term, on worst case scenario at least I knew it that the concept of use live spreadsheets works.

Future

I'm not sure about future, but even if things takes more time than initially planed (and this is very common) just looking by today, this is not necessary negative.

For example, I know that in special the software developers from the @covid-taskforce-cplp (the near 200 members on that private Facebook group) may be tempted to measure results by code, and even if just part of they become direct members of @HXL-CPLP (mostly because at that point of covid pandemic a lot was already tired) the mere existence of @covid-taskforce-cplp was necessary for the @HXL-CPLP. And at some point they will be remembered of this.

Anyway, I implicitly thank everyone involved, even those who do not appear publicly, but helped to manage other people.

fititnt commented 3 years ago

Just an very, very nice update: just very recently I tried to re-download again the COD (the same from the first link at https://data.humdata.org/dataset/brazil-administrative-level-0-boundaries) and the P-Code BR3106200 (Belo Horizonte) is not only on HXL Dash, but on the COD 😍!

So for now, even if it would still possible to download the files from IBGE and redoing all this work for @HXL-CPLP, actually we can simply uss directly the HDX dataset! This makes things so much easier.

I will just leave the screenshots here!

Captura de tela de 2020-12-06 00-59-36 Captura de tela de 2020-12-06 00-59-52 Captura de tela de 2020-12-06 01-01-04

Captura de tela de 2020-12-06 01-03-33

Captura de tela de 2020-12-06 01-20-44

Something I may look for 2021

Anyway, not for this year, but do exist some cases were may be pertinent the geojson of shapefiles be customizable, but I don't have practical user case (yet) as they are less generic than IBGE admin1 and admin2.

The simplest user case is when need to zoom an city. IBGE codes actually can be used for cities subdivisions, but for an country like Brazil (at least for HXLDash, maybe not for shapefiles on HDX) it would be an very large file load and entire country just to zoom an region. So, as for HXLDash, an option to explicitly tell the shapefiles would solve.

The not-so-simple case (and this would make we from @HXL-CPLP have to start get in contact with local organizations) is conflict zone inside Brazil. Two somewhat internationally know are the ones related to Amazon rainforest (indigenous areas) and Rio de Janeiro favelas. On both cases it may actually not be a case were people from government may not be interested to help on the borders, but they may simply don't know the information at all or don't know how to update borders if the have to use full traditional GIS tools. But on both cases, is more likely to have help from NGOs to agree which codes they may use as reference and them already start to have volunteers how know how to create the shapefiles or something like the shapefiles.

Anyway on both cases the fact to start promoting use of HXL-like spreadsheets already is an huge step. Also even without geometry files, the concept of people agreeing using something like P-codes already is likely to improve interoperability between different groups. Another point is start to contact software developer who were interested in the Covid 19 pandemic as they're likely to be more prone to like the HXL idea, but as I know the type of doubts they may to ask to start applying for average data on Brazil (they may ask about "CPF", "CNPJ", "CEP"…) then makes sense already make examples with such localized data (including ways to remove personal information based on hashtags before publishing).

2021 will be fun!