Analytics: Extract Law metadata

kohsah commented 6 years ago

There is a large amount of African Legislation online on the ILO NATLEX portal. This is a UN agency website, which has curated different African legislation subject-wise and with keywords.

For the Gawati Project (https://www.gawati.org) we are trying to build a repository of African Legislation which can be searched in one place, but has been curated from different sources. ILO NATLEX is one such source.

ILO NATLEX site : http://www.ilo.org/dyn/natlex/natlex4.home?p_lang=en

We are interested only in African countries, and these can be found under country profiles:

http://www.ilo.org/dyn/natlex/natlex4.byCountry?p_lang=en

Here if you see Zimbabwe for example: http://www.ilo.org/dyn/natlex/natlex4.countrySubjects?p_lang=en&p_country=ZWE

Each of the items here leads to a document citation:

E,g. clicking general provisions: http://www.ilo.org/dyn/natlex/natlex4.listResults?p_lang=en&p_country=ZWE&p_count=395&p_classification=01&p_classcount=87

If I click special economic zones act

http://www.ilo.org/dyn/natlex/natlex4.detail?p_lang=en&p_isn=104410&p_country=ZWE&p_count=395&p_classification=01&p_classcount=87

This refers to a “Special Economic Zones Act” of the country, the link to the pDF is highlighted below in yellow.

There is important metadata here about the document:

Name, Country, Type, Official Date (Adopted On), ISN Number and citation text + PDF itself.

We need to extract this information into the official AKoma Ntoso XML format used by gawati.

Steps to Take

1 Akoma Ntoso XML

Here are sample documents in Akoma Ntoso XML format (a) has the XML documents, and (b) has the PDF document that is described by (a) . https://github.com/gawati/gawati-data-xml/releases/download/1.2/akn_xml_sample-1.2.zip https://github.com/gawati/gawati-data-xml/releases/download/1.2/akn_pdf_sample-1.2.zip

2 Downloading Source Documents

For the ILO site, you need to first gather raw data. So by African country (lets start with 1 country, Zimbabwe to start with) , download the source metadata (as shown earlier) and the associated PDF document.

3 Processing Downloaded Data

Next step is to process the downloaded raw meta-data, and convert that to Akoma Ntoso format. The PDF file need not be converted, but needs to be associated with the corresponding Akoma Ntoso document, as shown in “1 Akoma Ntoso XML” above.

kohsah commented 6 years ago

Proposed changes to metadata.xml v1 (see attached v2 sample for comment suggestions) Also see the country code documents... You can use these to associate the document with a country code. The content is the same, the json file is perhaps easier to load and use from a python script. You need to use the "alpha-2" code for a country.

(files renamed as .txt to allow upload)

country_codes.json.txt metadata_v2.xml.txt metadata.xml.txt country_codes.xml.txt

kohsah commented 6 years ago

@yash1802 we also need to determine language of the document, but its not stated in the metadata. But the metadata has an abstract which seems to be always in the language of the document. So we can make a best guess.

I have used both of these libraries before for determining languages of fragments of text, I remember optimaize being slightly better, but langdetect may be more suitable here.

https://github.com/Mimino666/langdetect https://github.com/optimaize/language-detector

So you need to detect the language of the Abstract and use that to apply that into the document metdata something like:

(if the document is french )

<language code="fra" />

There is a json file with language codes, we use the 'alpha-3b' syntax :

https://raw.githubusercontent.com/gawati/gawati-portal-ui/dev/src/configs/languageCodes.json

kohsah commented 6 years ago

@yash1802 This should give you a list of african countries from country codes json:

import json

data = json.load(open('country_codes.json'))

african_countries = list(
        filter( 
        lambda country: country['region-code'] == '002',  
        data['countries']['country']
    )
)

The URLs on the natlex (http://www.ilo.org/dyn/natlex/natlex4.byCountry?p_lang=en) page have the following structure:

http://..../natlex4.countrySubjects?p_lang=en&p_country=BDI
                                                        ^^^ <== alpha-3 country code

and we have the alpha-3 code for each country in the country codes json. So you can iterate the african_countries array to determine the starting point urls you need to catch.

kohsah commented 6 years ago

@yash1802 See this document for info on extracting bibliography and a few other things: https://docs.google.com/document/d/1HsM7zGoulr3_dkxGQEwY9FONcrCfSJ3mtq7bKV4XUxw/edit?usp=sharing

gawati / gawati-data

Analytics: Extract Law metadata #13