HazyResearch / bootleg

Self-Supervision for Named Entity Disambiguation at the Tail
http://hazyresearch.stanford.edu/bootleg
Apache License 2.0
212 stars 27 forks source link

Feasibility of adding new types to disambiguate #33

Closed NLCas8 closed 3 years ago

NLCas8 commented 3 years ago

Hi there,

I read your paper on Bootleg, and I must say I was quite impressed with the results you managed to achieve on the Named Entity Disambiguation to WikiData terms.

Just trying to understand how it all ties together, and I was having a few questions I was hoping to get answered. I was looking for which entity types are supported for NED, after which I ran into this file: data/sample_emb_data/type_vocab.json. Are these all the types that are supported for NED?

Reason for asking is that I was considering to make use of bootleg to recognize and disambiguate all entities that are subclasses/instances of computer science terms or technical terms in a running text. These two types are however not in the vocabulary. Do you think it would be feasible to extend the vocabulary with these kind of types? Would this need additional training?

Thank you for your time.

Kind regards, Cas

lorr1 commented 3 years ago

Hello Cas,

So that file is just a small sample of our types to help people understand the format. If you download our full embedding data here, you can see all of the types we use. We do have Q66747126 and Q12812139 as a types in our system. There should be a README in the folder that explains the files.

In terms of extending to new types, we'd can certainly add new types in the mapping, but we'd have to retrain the type embeddings themselves.

NLCas8 commented 3 years ago

Thank you for your quick response, much appreciated!

Those are a lot more types than I expected. I was wondering about the disk, cpu and gpu requirements: would the requirements be much lower when limiting the number of different types of entities to disambiguate, say 10-100 instead of over 23.000? Or would it not make much of a difference? I was thinking: in that case the number of entities in the knowledge graph to disambiguate to could be significantly lowered, and hence less powerful hardware may be needed. I'm guessing things like irrelevant entity embeddings would need to be filtered out then.

Another question I was having is related to that most examples of input texts are questions written in natural language. How do you think the disambiguation would perform on input texts with less context? For example a sentence listing somewhat related terms like:

SQL, Python, R, Microsoft Excel and SPSS are examples of tools.

lorr1 commented 3 years ago

Hello Cas,

So I guess it depends on what your goal is with limiting the types. If, for example, there is a subset of entities you care about only (e.g., books), then we can do some entity subselection. This would reduce the size of the model (close to the Mini model sizes) and it would also shrink some of the data files - reducing the disk space and CPU requirements. It's actually an interesting question of how the model would perform when it was trained over all entities and then given only a smaller population of books. Would love to try it out and see what happens. Can you explain more about your use case? I'd be happy to get a filter script ready for you so you can try a few different things. Would you want to filter only on types or perhaps entity titles or other features?

Our model should do well on short sentences. We trained it over Wikipedia sentences so it's used to short contexts. The example data file in the tutorial is from the Natural Questions dataset, so it is all Google search questions, but it certainly doesn't have to be. Your examples is a great one for type information being useful. If our model can pick up that SQL is a programming construct/language, it can use that for the other mentions.

NLCas8 commented 3 years ago

Hi,

Maybe I missed it, but I was not able to find the requirements for the Mini model actually. I did find them for the Default model: 130 GB of disk space, 12 GB of GPU memory, and 40 GB of CPU memory as listed in install.rst and quickstart.rst.

My specific use case is about trying to recognize and disambiguate all mentions in a text that are somewhat related to technology in IT, in the broad sense of the word. So that would mean (all entities that are subclasses of) e.g. software, and thus also software suite > office suite > Microsoft Office. I was planning to use a dataset of raw IT job descriptions to do fine-tuning and testing. To generalize: If it would be possible to select which Wikidata types you want to retain, including all their subtypes, that would be awesome I think. That way you would not have to give up the performance of the Default model by using the Mini model, while still being able to disambiguate all relevant types for your subdomain.

If you need help with the script, please let me know, I am willing to help with this. I am just not quite familiar yet with the codebase.

lorr1 commented 3 years ago

Hey Cas,

We haven't put the requirements of the mini model in our instructions. That's a great idea. We'll add that for our next release.

I'll get to work on the filtering script. I ironically need it for something else I'm working on. I'll get a version 1 working so that we can reduce the type and entity set for a model (with possibly adding new types/entities, too). We'll likely need to iterate some on the best interface to make the script felxible and easy for you to use. I'll ping when it's ready for testing.

NLCas8 commented 3 years ago

Hi,

I would be happy to help testing!

Something else that came to mind just now, have you considered hosting Bootleg on Huggingface.co? It may be a nice way for people to find out about Bootleg and what it is capable of :)

NLCas8 commented 3 years ago

Hi there,

Just wondering, how is the progress going? If there is something I can do to support you, please let me know!

lorr1 commented 3 years ago

Hey Cas,

Thanks for checking in. I have been working on a Bootleg API that should allow for modifying metadata (e.g., type mappings) and adding/removing entities based. I have a V1 notebook ready that has some of the core features implemented (I should have the removing of entities ready in about 24 hours). I'd appreciate your feedback on it to see what you think/what features are missing/what are the pain points etc. Would you be willing to take a look? If so, perhaps I can send it to you via email along with instructions for what branch to be on etc?

NLCas8 commented 3 years ago

Hi,

That's great to hear! Sure, I'd be happy to take a look and do some initial testing. Once ready you may send it to redacted.

lorr1 commented 3 years ago

Added API full support in #48