alpheios-project / tokenizer

Alpheios Tokenizer Service
1 stars 0 forks source link

review initial design #3

Closed balmas closed 4 years ago

balmas commented 4 years ago

@kirlat and @irina060981 please take a look at the proposed API and initial implementation for the tokenizer service in this repository (sorry future requests for review will be via PR as normal)

The readme contains instructions for running the service via Docker and executing requests to get the OpenAPI schema for the operations, requests and responses as well as sample request for tokenizing tei and text.

I'd like your feedback on both the API and the code before proceeding further.

The tests in tests/lib/spacy/test_processor.py show the different uses cases for combinations of input text and query parameters.

The tests in tests/lib/tei/test_parser.py show a few basic scenarios of how TEI XML can be supported.

Here are the big things that remain to be done:

1) Support for more granular specification of element filters for TEI XML

2) Use of language-specific tokenization rules/models

3) Ingest of TEI XML from a DTS API URL

4) Parsing of document level metadata (although it's not entirely clear to me that this really belongs in the tokenization service)

5) Deciding where token and segment ids are assigned (client, tokenization service, other service)

But I think the basics are there to support the following interactions between the Alignment Editor and the Tokenization Service

tokenization

irina060981 commented 4 years ago

@balmas, I need a little more time to review and test - will publish my thoughts tomorrow

kirlat commented 4 years ago

Two questions:

  1. Would the GET / request return the API descriptions for all types of requests or just the tokenization options?
  2. Would we version the API? If so, how would it be reflected in resource URLs?
kirlat commented 4 years ago

I've built a Docker image successfully, bun was not able to run a Docker container 🙁. The docker run -p 5000:5000 tokenizer command returned the standard_init_linux.go:211: exec user process caused "no such file or directory" error. I'm not sure I will be able to solve it. Probably it was due to incompatibilities between the build process on Linux and Windows.

But I've checked the source code and, to the best of my understanding of Python, I was not able to notice any issues. But, of course, I might miss the finer details that are not so obvious at first look. The code seems to be well structured and organized, and I don't see at the moment any areas where it can be improved.

balmas commented 4 years ago

I've built a Docker image successfully, bun was not able to run a Docker container . The docker run -p 5000:5000 tokenizer command returned the standard_init_linux.go:211: exec user process caused "no such file or directory" error. I'm not sure I will be able to solve it. Probably it was due to incompatibilities between the build process on Linux and Windows.

Shoot I wonder if it's the call to the run.sh in the ENTRYPOINT definition. I make the Dockerfile quickly just so that you could both run it but I didn't test that it was platform independent. Will try to look into that asap.

balmas commented 4 years ago

I made a slight update to the Dockerfile to hopefully fix the problem running it on windows.

kirlat commented 4 years ago

I made a slight update to the Dockerfile to hopefully fix the problem running it on windows.

Thanks for the update! After the pull it went a little further. When I run it, it paused for a second or two as it was doing something in the background, and then spit out a different error message:

C:\uds\projects\alpheios\tokenizer>docker run -p 5000:5000 tokenizer
Usage: run.py [OPTIONS] COMMAND [ARGS]...
Try 'run.py --help' for help.

'.ror: No such command 'server

I can't understand what it is, but maybe you will have an idea? It seems that it was starting something (the server?) this time.

irina060981 commented 4 years ago

I was able to startup the docker container with Python 3.7 Docker Windows Desktop 2.3.0.5

kirlat commented 4 years ago

I was able to startup the docker container with Python 3.7 Docker Windows Desktop 2.3.0.5

Great! I have Python 3.8.1 and Docker 2.3.0.5, it seems to be very similar to yours, so there is probably something else.

irina060981 commented 4 years ago

I have made a simple test (with Postman) - plain text for the following text

In nova fert "animus-mutatas" dicere formas
corpora. Di , coeptis (nam vos mutastis-et illas)

Got the result

{
  "segments": [
    {
      "index": 0,
      "tokens": [
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 0,
          "index": 0,
          "line_break_before": false,
          "punct": false,
          "text": "In"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 1,
          "index": 1,
          "line_break_before": false,
          "punct": false,
          "text": "nova"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 2,
          "index": 2,
          "line_break_before": false,
          "punct": false,
          "text": "fert"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 3,
          "index": 3,
          "line_break_before": false,
          "punct": true,
          "text": "\""
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 4,
          "index": 4,
          "line_break_before": false,
          "punct": false,
          "text": "animus"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 5,
          "index": 5,
          "line_break_before": false,
          "punct": true,
          "text": "-"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 6,
          "index": 6,
          "line_break_before": false,
          "punct": false,
          "text": "mutatas"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 7,
          "index": 7,
          "line_break_before": false,
          "punct": true,
          "text": "\""
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 8,
          "index": 8,
          "line_break_before": false,
          "punct": false,
          "text": "dicere"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 9,
          "index": 9,
          "line_break_before": false,
          "punct": false,
          "text": "formas"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 11,
          "index": 10,
          "line_break_before": false,
          "punct": false,
          "text": "corpora"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 12,
          "index": 11,
          "line_break_before": false,
          "punct": true,
          "text": "."
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 13,
          "index": 12,
          "line_break_before": false,
          "punct": false,
          "text": "Di"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 14,
          "index": 13,
          "line_break_before": false,
          "punct": true,
          "text": ","
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 15,
          "index": 14,
          "line_break_before": false,
          "punct": false,
          "text": "coeptis"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 16,
          "index": 15,
          "line_break_before": false,
          "punct": true,
          "text": "("
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 17,
          "index": 16,
          "line_break_before": false,
          "punct": false,
          "text": "nam"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 18,
          "index": 17,
          "line_break_before": false,
          "punct": false,
          "text": "vos"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 19,
          "index": 18,
          "line_break_before": false,
          "punct": false,
          "text": "mutastis"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 20,
          "index": 19,
          "line_break_before": false,
          "punct": true,
          "text": "-"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 21,
          "index": 20,
          "line_break_before": false,
          "punct": false,
          "text": "et"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 22,
          "index": 21,
          "line_break_before": false,
          "punct": false,
          "text": "illas"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 23,
          "index": 22,
          "line_break_before": false,
          "punct": true,
          "text": ")"
        }
      ]
    }
  ]
}
irina060981 commented 4 years ago

Some questions about the result:

  1. There are no line_break_before: true but I passed 2 rows

image

  1. What are the meanings for docIndex and index? From 1-6 they are equal but later they are not equal.

  2. What about the words with -, in my example animus-mutatas and mutastis-et? I believe it should be one word? I think that we won't be able to define it on the client - we could have - as a part of the word, and - as a part of the sentence

  3. For plain text do we have always only one segment?

irina060981 commented 4 years ago

Great! I have Python 3.8.1 and Docker 2.3.0.5, it seems to be very similar to yours, so there is probably something else.

@kirlat I have full environment to work with Flask on my local machine, as I could see here we have Flask:

image

May be you don't have something for Flask? (it is strange for docker, but it could be I believe)

irina060981 commented 4 years ago

About TEI:

  1. do we have a test with more than one segment? I tried text from tests/fixtures/tei/multisegout.txt both - inside plain-text format and tei format - but I didn't get more than 1 segment; the same for the tei/caesarciv.xml - only one segment
kirlat commented 4 years ago

@kirlat I have full environment to work with Flask on my local machine, as I could see here we have Flask

It might be that I'm missing something in my configuration, but I was thinking that all elements needed to run the service are within the Docker container and the outside environment is of no importance, as long as the Docker is installed. @balmas, do we need anything outside the container in order to run the service?

irina060981 commented 4 years ago

@balmas , also we don't pass ltr/rtl - so do we need this parameter only for alignment-editor?

balmas commented 4 years ago

@kirlat I have full environment to work with Flask on my local machine, as I could see here we have Flask

It might be that I'm missing something in my configuration, but I was thinking that all elements needed to run the service are within the Docker container and the outside environment is of no importance, as long as the Docker is installed. @balmas, do we need anything outside the container in order to run the service?

It should not be necessary to have anything outside the container. Did you try with the latest Docker file?

irina060981 commented 4 years ago

Also I didn't fully understand defining citation (by the way it doesn't define citation without the first blank line)

text:


META|CITE_citation1 In nova
META|CITE_citation2 corpora.

result

{
  "segments": [
    {
      "index": 0,
      "tokens": [
        {
          "alpheios_data_cite": "citation1",
          "alpheios_data_tb_word": "",
          "docIndex": 2,
          "index": 0,
          "line_break_before": false,
          "punct": false,
          "text": "In"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 3,
          "index": 1,
          "line_break_before": false,
          "punct": false,
          "text": "nova"
        },
        {
          "alpheios_data_cite": "citation2",
          "alpheios_data_tb_word": "",
          "docIndex": 6,
          "index": 2,
          "line_break_before": false,
          "punct": false,
          "text": "corpora"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 7,
          "index": 3,
          "line_break_before": false,
          "punct": true,
          "text": "."
        }
      ]
    }
  ]
}

Is it correct, that citation is defined only for the first words? And again - there are no line-breaks here

irina060981 commented 4 years ago

And about the code: I didn't fully understand how do you define the config for segmentation:

by steps:

    segments = processor.tokenize(
        text=text,
        lang=config['lang'],
        segmentOn=config['segments'],
        segmentStart=config['segstart'],
        segmentMetadataTemplate=segmentMetadataTemplate
    )

Processer - is a outboard spacy class For text config is defined from a request

config = schema.load(request.args)

For Tei

config = schema.load(request.args)
config['segments'] = 'doubleline'

Could you explain a little bit more here - how it works? What arguments should be pased here?

I believe it would be good to have some descriptions for available request params

balmas commented 4 years ago

Some questions about the result:

  1. There are no line_break_before: true but I passed 2 rows

image

It's possible a change is needed in the service to have it interpret the POSTed data as binary to retain the new lines in plain text. I found that with curl, the newlines were flattened unless I used the --data-binary argument to pass the data with newlines intact. I was thinking that was a curl issue, but if you are having it with Postman too, maybe I need to do something on the service side. I will enter an issue and look into it.

  1. What are the meanings for docIndex and index? From 1-6 they are equal but later they are not equal.

docIndex is the index of the token across all segments, and index is the index within the segment. However, as I exclude certain tokens (e.g. metadata only and newlines) the the segments, they won't always match up. To be honest, i'm not sure of the value of returning the docIndex. I am doing it mainly to assist with reproducibility or later alignment tokens with the original source document.

  1. What about the words with -, in my example animus-mutatas and mutastis-et? I believe it should be one word? I think that we won't be able to define it on the client - we could have - as a part of the word, and - as a part of the sentence

Yes, this needs to be handled still by the language-specific models.

  1. For plain text do we have always only one segment?

See answer to 1 above, plain text submitted with newlines and default arguments SHOULD be interpreted as linebreak=singleline and thus as 1 segment per line, but if the newlines are being flattened in the plaintext input it doesn't work right.

balmas commented 4 years ago

About TEI:

  1. do we have a test with more than one segment? I tried text from tests/fixtures/tei/multisegout.txt both - inside plain-text format and tei format - but I didn't get more than 1 segment; the same for the tei/caesarciv.xml - only one segment

test_tokenize_segline in the test_tokenizer.py shows a use of withlines.xml for multisegment results with TEI

test_tokenize_text_defaults in test_tokenizer.py shows a use of lineseg.txt for multisegment results with text. But for the text, per previous comment, I may have more work to do to get the newlines not to be stripped from the plaintext input in real life use.

balmas commented 4 years ago

Also I didn't fully understand defining citation (by the way it doesn't define citation without the first blank line)

text:


META|CITE_citation1 In nova
META|CITE_citation2 corpora.

result

{
  "segments": [
    {
      "index": 0,
      "tokens": [
        {
          "alpheios_data_cite": "citation1",
          "alpheios_data_tb_word": "",
          "docIndex": 2,
          "index": 0,
          "line_break_before": false,
          "punct": false,
          "text": "In"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 3,
          "index": 1,
          "line_break_before": false,
          "punct": false,
          "text": "nova"
        },
        {
          "alpheios_data_cite": "citation2",
          "alpheios_data_tb_word": "",
          "docIndex": 6,
          "index": 2,
          "line_break_before": false,
          "punct": false,
          "text": "corpora"
        },
        {
          "alpheios_data_cite": "",
          "alpheios_data_tb_word": "",
          "docIndex": 7,
          "index": 3,
          "line_break_before": false,
          "punct": true,
          "text": "."
        }
      ]
    }
  ]
}

Is it correct, that citation is defined only for the first words? And again - there are no line-breaks here

hmm, here the citation should be defined on the segment but I think it's not working as expected because the of the newline problem.

Metadata at the beginning of a segment should apply to the segment

metadata before a word should apply to the word.

The tests that show the expected behavior are in tests/lib/spacy/test_processor.py: test_tokenize_linesegcite shows citation at the beginning of a segment test_tokenize_linesegcustomtb shows treebank metadata on both the segment and the word

balmas commented 4 years ago

Could you explain a little bit more here - how it works? What arguments should be pased here?

I believe it would be good to have some descriptions for available request params

The request params do have descriptions in the openapi schema which is available at the base endpoint for the service

{
  "paths": {
    "/tokenize/tei": {
      "post": {
        "description": "Tokenize a TEI XML text",
        "parameters": [
          {
            "in": "query",
            "name": "linebreaks",
            "required": false,
            "description": "Comma-separated list of elements to line-break after for display.",
            "schema": {
              "type": "string",
              "default": "p,div,seg,l,ab"
            }
          },
          {
            "in": "query",
            "name": "ignore",
            "required": false,
            "description": "Comma-separated list of elements whose contents should be ignored.",
            "schema": {
              "type": "string",
              "default": "label,ref,milestone,orig,abbr,head,title,teiHeader,del,g,bibl,front,back,speaker"
            }
          },
          {
            "in": "query",
            "name": "lang",
            "required": true,
            "description": "Language code of text to be tokenized.",
            "schema": {
              "type": "string"
            }
          },
          {
            "in": "query",
            "name": "segstart",
            "required": false,
            "description": "Starting segment index.",
            "schema": {
              "type": "integer",
              "format": "int32",
              "default": 0
            }
          },
          {
            "in": "query",
            "name": "segments",
            "required": false,
            "description": "Comma-separated list of elements which identify segments.",
            "schema": {
              "type": "string",
              "default": "body"
            }
          },
          {
            "in": "query",
            "name": "tbseg",
            "required": false,
            "description": "True means 'alpheios_data_tb_sent' metadata to be set from segment index",
            "schema": {
              "type": "boolean",
              "default": false
            }
          }
        ],
        "responses": {
          "201": {
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/TokenizeResponse"
                }
              }
            }
          }
        }
      }
    },
    "/tokenize/text": {
      "post": {
        "description": "Tokenize a plain text document.",
        "parameters": [
          {
            "in": "query",
            "name": "segstart",
            "required": false,
            "description": "Starting segment index.",
            "schema": {
              "type": "integer",
              "format": "int32",
              "default": 0
            }
          },
          {
            "in": "query",
            "name": "segments",
            "required": false,
            "description": "Segment indicator.",
            "schema": {
              "type": "string",
              "default": "singleline",
              "enum": [
                "singleline",
                "doubline"
              ]
            }
          },
          {
            "in": "query",
            "name": "lang",
            "required": true,
            "description": "Language code of text to be tokenized.",
            "schema": {
              "type": "string"
            }
          },
          {
            "in": "query",
            "name": "tbseg",
            "required": false,
            "description": "True means 'alpheios_data_tb_sent' metadata to be set from segment index.",
            "schema": {
              "type": "boolean",
              "default": false
            }
          }
        ],
        "responses": {
          "201": {
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/TokenizeResponse"
                }
              }
            }
          }
        }
      }
    }
  },
  "info": {
    "title": "Alpheios Tokenizer Service",
    "version": "1.0.0"
  },
  "openapi": "3.0.2",
  "components": {
    "schemas": {
      "Token": {
        "type": "object",
        "properties": {
          "punct": {
            "type": "boolean",
            "description": "Indicates if the Token is Punctuation."
          },
          "alpheios_data_cite": {
            "type": "string",
            "description": "Metadata field for Alpheios Reading Tools - provides Citatable Identifier."
          },
          "text": {
            "type": "string",
            "description": "Text contents of the Token."
          },
          "docIndex": {
            "type": "integer",
            "format": "int32",
            "description": "Index of the Token in the Document."
          },
          "index": {
            "type": "integer",
            "format": "int32",
            "description": "Index of the Token in the parent Segment."
          },
          "line_break_before": {
            "type": "boolean",
            "description": "Indicates if the Token should have a Line Break displayed before it."
          },
          "alpheios_data_tb_word": {
            "type": "string",
            "description": "Metadata field for Alpheios Reading Tools - provides Treebank Word Identifier."
          }
        },
        "required": [
          "docIndex",
          "index",
          "line_break_before",
          "punct",
          "text"
        ]
      },
      "Segment": {
        "type": "object",
        "properties": {
          "alpheios_data_tb_sent": {
            "type": "string",
            "description": "Metadata field for Alpheios Reading Tools - provides Treebank Sentence Identifier."
          },
          "index": {
            "type": "integer",
            "format": "int32",
            "description": "Index of Segment in the Document."
          },
          "alpheios_data_cite": {
            "type": "string",
            "description": "Metadata field for Alpheios Reading Tools - provides Citatable Identifier."
          },
          "tokens": {
            "type": "array",
            "description": "List of Tokens in the Segment.",
            "items": {
              "$ref": "#/components/schemas/Token"
            }
          }
        },
        "required": [
          "index",
          "tokens"
        ]
      },
      "TokenizeResponse": {
        "type": "object",
        "properties": {
          "metadata": {
            "type": "object",
            "description": "Text-level metadata dictionary."
          },
          "segments": {
            "type": "array",
            "description": "List of Segments.",
            "items": {
              "$ref": "#/components/schemas/Segment"
            }
          }
        },
        "required": [
          "metadata",
          "segments"
        ]
      }
    }
  }
}

(if you load this schema into an openapi client, such as the editor at https://editor.swagger.io/ you'll get a nice view of the api, although I see I do have a couple of errors :-) )

Screenshot from 2020-09-17 09-48-26

balmas commented 4 years ago

@balmas , also we don't pass ltr/rtl - so do we need this parameter only for alignment-editor?

It's a good question. I think the language model should handle it appropriately but it could be that we will need it. I will add an issue for myself about that.

irina060981 commented 4 years ago

@balmas , I tried http://localhost:5000/tokenize/text?lang=lat with binary text: image

Anyway citation is not applied correctly without the first empty line. image

If I add an empty line, then citation1 appears but also I got an empty segment. image

irina060981 commented 4 years ago

Also I have a question If tokenization service works correctly only with binary files, how would we use it inside the application? windows.fetch and axios would send data correctly? Or should we have some convertion steps here?

balmas commented 4 years ago

Also I have a question If tokenization service works correctly only with binary files, how would we use it inside the application? windows.fetch and axios would send data correctly? Or should we have some convertion steps here?

It's my belief that this is a Curl problem specifically because Curl strips the newlines. I could be wrong about that. it will be easy enough to test.

balmas commented 4 years ago

Anyway citation is not applied correctly without the first empty line. image

If I add an empty line, then citation1 appears but also I got an empty segment. image

Ok, well I will look at this. It was working fine in my tests. Will move to an issue.

balmas commented 4 years ago

@irina060981 you can begin to use the tokenizer service for your development now. It's currently deployed at https://tools.alpheios.net/tokenizer/.

We still need to work out how the alignment editor will get the base url for the service in production. I think it should probably come from a call to the config service.