Parsing of tables - Githubissues

plison commented 3 years ago

First of all, thanks for the great work on this dataset!

I've noticed in some court cases (such as Sklyar v. Russia) that tables that were in the original document were ignored by the crawler, and thus from the resulting dataset.

Would it be possible to include these tables in one way or another (for instance using a tab-separated format)? Or, if that turns out to be difficult, at least get a flag in the meta-data that would indicate that the original text included some tabular elements that were ignored by the parser?

aquemy commented 3 years ago

Hi @plison,

As you mentioned, the main problem here is that some steps uses only textual information to generate BoW and several text embeddings. It would be nice to parse the table, add it to meta-data or document list and refer it somehow in the document itself..

aquemy commented 3 years ago

Hi Pierre @plison,

Could you give me a small feedback on the following approach/design please? In particular if it would suit your needs.

I managed to use python-docx to locate precisely the table and extract the content in JSON format that can be directly loaded by pandas or numpy (stored as one entity per row):

Line 54 <docx.table.Table object at 0x7f5c210cf9d0>
[{'Period of detention\n': '\n10-15 February 2011\n', 'Unit no.\n\n': 'Quarantine section', 'Dormitory surface area in square metres\n\n': '50.2', 'Number of sleeping places\n': '24', 'Number of inmates assigned to the dormitory\n': '6-24', 'Number of washbasins and lavatories': '2 and 2'}, {'Period of detention\n': '\n15 February – 30\xa0May\xa02011', 'Unit no.\n\n': '\n17', 'Dormitory surface area in square metres\n\n': '219', 'Number of sleeping places\n': '109', 'Number of inmates assigned to the dormitory\n': '100-108', 'Number of washbasins and lavatories': '8 and 10'}, {'Period of detention\n': '\n30 May 2011 – February 2013', 'Unit no.\n\n': '7', 'Dormitory surface area in square metres\n\n': '213.2', 'Number of sleeping places\n': '106', 'Number of inmates assigned to the dormitory\n': '100-105', 'Number of washbasins and lavatories': '8 and 10'}]

The tables should be at least be referenced in the judgment tree at the exact location they appear. I can image it will be easier to work directly on tables if it is somehow outside the tree itself.

Something like this:

"content": {
        "001-175680.docx": [
            {
                "content": "INTRODUCTION",
                "elements": [...],
                "section_name": "introduction"
            },
            {
                "content": "THE FACTS",
                "elements": [
                    ...
                    {
                        "content": "10.  As regards the conditions of the applicant’s detention in the IK-8 facility, the Government submitted information which can be summarised as follows:",
                        "elements": [
                            {
                                "type": table
                                "content": "table-1",
                                "elements": [
                                ]
                            }
                        ]
                    },
                ]
            }
}
"metadata": {
    "001-175680.docx": {
        "table-1": {
            "type": "table",
            "content": {see above}
        }
}

Minimal steps:

Modifying the preprocessing parser to iterate not only over paragraphs but over XML and rebuilding Paragraph and Table object based on their type.
Cleaning the cells from tabs, spaces, etc.
Modifying JSON tree representation to include a field "type" (by default text/paragraph) to reference external meta-data (E.g. {"type": "table", "content": "table-1"})
Adding table to the SQL database
Taking into account tables in NLP models (Bag-of-Words and so on)
Displaying tables in the web explorer
Ajust documention and functions to parser/recreate the document from the JSON description

Extra-steps:

Checking accross the whole datasets is tables might be column oriented rather than row oriented and if there are some such tables, try to detect it to have the proper orientation
Detecting the type for each column (e.g. date, float, etc.)

plison commented 3 years ago

Thanks for testing this! This looks like a neat solution. Another solution would be to adopt a markdown format to represent the table (https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables), which would have the advantage of being purely text-based like the rest of the document, yet provides a way to easily convert to more structured formats such as pandas data frames.

For my own needs, I must admit that I decided to simply skip documents including tables at the moment, but I may come back to it at a later stage.

aquemy commented 3 years ago

I decided to use "attachement" instead of "metadata" as the Tables are part of the judgment document itself. The field "metadata' will be used, among other things, for entities extracted from the text (#124 PoC in progress).

More info about the implementation in the PR #165.

echr-od / ECHR-OD_process

Parsing of tables #151