ingestion into the NLPL Vectors Repository

oepen commented 5 years ago

i am wondering whether you could be interested in having the FinBERT model(s) hosted as part of the repository for word embeddings maintained by the NLPL consortium? among other things, this would provide a stable download URL and a well-defined place on the filesystem on various HPC systems (including Puhti and Taito). the NLPL repository capitalizes on languages of Northern Europe, and we (at oslo) have recently started to ingest contextualized embeddings (using ELMo, in our case) for Norwegian (and a few more languages). i expect @akutuzov would be happy to assist with ingestion into the NLPL repository. for general background, please see:

http://wiki.nlpl.eu/index.php/Vectors/home

fginter commented 5 years ago

Sure, why not. :) You probably want to grab both TF and pytorch versions.

akutuzov commented 4 years ago

At the NLPL repository, we have a standardized metadata structure for all the models we host. @fginter Can you please check below whether I filled the fields correctly for the cased FinBERT (I put comments where the meaning of a field is not immediately obvious)? Also, would you like the models to be stored as TF checkpoints or rather in the HDF5 format?

{
    "algorithm": {  # The description of the training algorithm
        "id": 7,  # this will be the NLPL persistent id for the BERT architecture
        "name": "BERT",
        "url": "https://github.com/google-research/bert",
        "version": null  # Or do you may be want to tell users a specific version of BERT that you used?
    },
    "contents": [    # This is a list of files in the archive
        {
            "filename": "bert_config.json",
            "format": "json"
        },
        {
            "filename": "bert_model.ckpt.index",
            "format": "data"
        },
       {
            "filename": "bert_model.ckpt.meta",
            "format": "data"
        },
       {
            "filename": "bert_model.ckpt.data-00000-of-00001",
            "format": "data"
        },
            {
            "filename": "meta.json",
            "format": "json"
        },
        {
            "filename": "vocab.txt",
            "format": "text"
        }
    ],
    "corpus": [  # The description of the training corpus
        {
            "NER": false,   # No named entity recognition was done on the corpus
            "case preserved": true,
            "description": "Finnish Web Corpus",  # Or what would be the correct human-readable name for this corpus?
            "id": 107,  # This will be the NLPL persistent id for this corpus
            "language": "fin",
            "lemmatized": false,  # No lemmatization was performed, right?
            "public": true,  # Is the corpus publicly available?
            "stop words removal": null,   # No stop words were removed, right?
            "tagger": null,   # No PoS tagging
            "tagset": null,
            "tokens": 3000000000,  # Do you have the exact number of tokens?
            "tool": null,  # Any specific tool used to create the corpus?
            "url": null  # Any public source of the corpus?
        }
    ],
    "dimensions": 768,  # Hidden layer size
    "documentation": "https://github.com/TurkuNLP/FinBERT",
    "external_id": "Cased Finnish BERT Base (FinBERT)",
    "id": 197,  # This will be the NLPL persistent id for this particular model
    "iterations": null, # For how many epochs you trained (how many passes over the corpus)?
    "maintainers": [
        {
            "email": "figint@utu.fi",
            "name": "Filip Ginter"
        }
    ]
}

haamis commented 4 years ago

Hi,

this seems to be all correct but adding the pytorch version (link) as well would be nice. It uses the same vocab and config files as the TF checkpoints.

Sorry about the late reply.

akutuzov commented 4 years ago

Hi @haamis We have already published the TF versions (see the models with the ids 197 and 198 here). Do you think it makes sense to append the pytorch versions to the existing archives, or it would be better to release them as separate models with different identifiers?

oepen commented 4 years ago

if it is really the same models (same parameters), i would be in favor of publishing both serializations with the same repository identifier (a bit like we do for textual and binary formats of the non-contextualized models).

even though 197 and 198 are published already (a week or so ago), we can allow ourselves to monotonically extend the contents of those archives, i think.

haamis commented 4 years ago

They are indeed the same parameters just converted to a format the hugging face transformers library uses.

akutuzov commented 4 years ago

@haamis Do you have the uncased pytorch model as well?

fginter commented 4 years ago

@akutuzov http://dl.turkunlp.org/finbert/torch-transformers/ ...and now also distributed as part of huggingface https://github.com/huggingface/transformers/blob/master/src/transformers/configuration_bert.py#L47

akutuzov commented 4 years ago

Done, both archives are now updated in the NLPL repository. This issue can be closed, I guess.

TurkuNLP / FinBERT

ingestion into the NLPL Vectors Repository #1