HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

Using candidates for prediction (Fonduer Prediction Pipeline) #263

Closed atulgupta9 closed 5 years ago

atulgupta9 commented 5 years ago

Scenario:

For my use case I have a set of financial documents.

The entire document set is divided into train,dev and test. The documents are parsed and the mentions and candidates are extracted with some rules.

The featurized training candidates are used to train a Fonduer Learning model and the model is used to predict on the test candidates, as per the normal fonduer pipeline as demonstrated in the hardware tutorial.

Problems & Questions

  1. Is the fonduer prediction pipeline production ready? How can we fine tune it to achieve better accuracy? Should the main focus be on the quality of the extracted mentions?

With my initial analysis and usage following the hardware tutorial, I could not obtain good results.

  1. Can we separate the training and test pipeline?

As in the current scenario, with a new document that I will feed for prediction, The entire corpus will have to be parsed to extract the mentions and candidates and store the feature keys.

Please correct me, if that won't be the case and help me with a snippet to showcase the separation.

HiromuHota commented 5 years ago

I'd leave the 1st question to @SenWu or @lukehsiao .

For the 2nd question, technically yes. We can separate the training and test pipeline, but it is not straight-forward to do so as of today as I described in #259.

The entire corpus will have to be parsed to extract the mentions and candidates and store the feature keys.

No, you don't have to parse the entire corpus again in the test phase, assuming the entire corpus includes train, dev, and test. To do so, use the latest commit (not released yet) version of Fonduer, which has #258, and follow the instructions below.

In the training phase, save the feature keys in a file like below:

featurizer = Featurizer(session, candidate_classes)
featurizer.apply(split=0, train=True, parallelism=PARALLEL)
key_names = [key.name for key in featurizer.get_keys()]
with open('feature_keys.pkl', 'wb') as f:
    pickle.dump(key_names, f)

In the test phase, load the feature keys from the file to the database as below:

featurizer = Featurizer(session, candidate_classes)
with open('feature_keys.pkl', 'rb') as f:
    key_names = pickle.load(f)
featurizer.drop_keys(key_names)
featurizer.upsert_keys(key_names)

You may have more questions about 2nd question. I'm happy to help you as much as I can.

atulgupta9 commented 5 years ago

Thanks @HiromuHota . I will definitely try this out. I do have more queries on the prediction capabilities of fonduer. Please help me get some clarity on these.

  1. How well does fonduer fair when we have documents that differ structurally?

  2. Is the main role of the model to predict whether the candidate is a true candidate or not ?

  3. How can we scale this to multiple candidates, does that mean creating separate models for each of those?

senwu commented 5 years ago

Hi @atulgupta9,

Thanks for your interests in our research!

Let me try to answer your questions, and hope it can help to understand more about Fonduer.

Is the fonduer prediction pipeline production ready? How can we fine tune it to achieve better accuracy? Should the main focus be on the quality of the extracted mentions?

We currently provide two prediction model in Fonduer (Logistic regression, LSTM). You can tune those two models pretty easily and we provide a simple interface for that (see here). Of course you can also customize your own prediction model based on your needs (for example if you want to use BERT to extract features, here are some references).

In order to achieve better quality, there are many factors you might want to consider: (1) do you have good candidates? e.g., the candidate can cover most of the ground truth (or in other words good recall) (2) do you have good labels for candidates to train? (do those labels have a good correlation with the ground truth?) High-quality label data is very important for a machine learning model to learn a good prediction model. (3) does the class balance is good? Usually, we expect the data has good class balance. e.g., 50% positive data and 50% negative data. (4) does the model good for your data? It highly depends on your problem. Our current model tries to incorporate the multi-modality features into the learning model to learn more signals. From our experience and feedback from our collaboration, those signals are pretty useful in extracting information from richly formatted data.

We recently published another paper which shares some best practices, you can find it here.

How well does fonduer fair when we have documents that differ structurally?

One of the design goals of Fonduer is to address the data variety in richly formatted data. In order to address that, we propose to use a unified data model to unify the data with a different structure (like HTML). We are keep improving our data model to handle more cases (we now support plain text, CSV, TSV, HTML, PDF, and etc..). Please let us know if you have any suggestion or cases that it doesn't work.

Is the main role of the model to predict whether the candidate is a true candidate or not?

Yes, the prediction model is trying to predict the true candidate based on your label data and the useful signal of data.

How can we scale this to multiple candidates, does that mean creating separate models for each of those?

I assume you mean multiple relations (correct me if I am wrong). Right now, Fonduer supports multiple relation candidates extraction, but the learning part only supports a single relation. We will release a new machine learning to support learning multiple relations simultaneously soon.

atulgupta9 commented 5 years ago

Thanks @SenWu for a speedy response. It will surely help me proceed forward.

Please can we keep this thread open. I will try to post some snippets to let you guys understand the use case we are trying to tackle with fonduer and get your inputs on our approach.

atulgupta9 commented 5 years ago

Use Case Discussion

We have sourced financial documents from the internet, these documents have no definite structure or a repeating pattern. The basic need is to extract and build details about the organization (CEO, Board of Directors, Revenue Generated, Net Income/Loss etc..). The required information may be spread across the document or may be confined to a few pages which we cannot pinpoint.

We will be confining this discussion to the extraction and prediction of revenue terms. Below are some snaps of the documents we are handling. Each snap is of a different document. image image image image

Although, the above images reflect a specific structure being followed but we have about 299 documents with many of them containing information in images/running text rather than tables.

The entire set is divided as follows: Train Set : 180 docs Dev Set : 60 docs Test Set : 59 docs

Code

Extracting the mentions

For extracting the revenue mentions we have captured revenue related tags in a csv and are using it

revenue_mention = mention_subclass("revenue_mention")

revenue_tags = set(df['revenue tag'].str.lower().tolist())

def filter_revenue_with_tags(mention):    
    for tag in revenue_tags:
        if str(tag).lower().strip() in str(mention.sentence.text).lower():
            return True
    return False   

filter1 = LambdaFunctionMatcher(func = filter_revenue_with_tags,longest_match_only=True)
sentence_mentions = MentionSentences()
mention_extractor = MentionExtractor(session, [revenue_mention],[sentence_mentions],[filter1])

After extraction we got: Mentions: 293205

Getting the candidates

For candidate extraction , we have further refined those tags and limited it using throttlers.

revenue_cand = candidate_subclass("revenue_cand",[revenue_mention])

def filter_revenue_with_keyword(c):
    keywords = ['revenues for the fiscal year','revenue for the fiscal year','revenue for the year','revenues for the year','net earning','net loss','revenue of','revenues of','operating revenue','net sale','record revenue','gross revenue','net revenue','revenue increased','revenues increased','service revenue','present value','sale revenue','sales revenue','total oil and gas sale','consolidated revenue','operating revenue','cost of revenue','sale']
    for keyword in keywords:
        if keyword in str(c[0][0].get_span()).lower():
            if '$' in str(c[0][0].get_span()).lower():
                return True
    return False

candidate_extractor_revenue = CandidateExtractor(session, [revenue_cand],throttlers=[filter_revenue_with_keyword])

for i, docs in enumerate([train_docs, dev_docs, test_docs]):

    candidate_extractor_revenue.apply(docs, split=i, parallelism=PARALLEL)
    print("Number of Candidates in split={}: {}".format(i, session.query(revenue_cand).filter(revenue_cand.split == i).count()))

train_cands_rev = candidate_extractor_revenue.get_candidates(split = 0)
dev_cands_rev = candidate_extractor_revenue.get_candidates(split = 1)
test_cands_rev = candidate_extractor_revenue.get_candidates(split = 2)    

So basically we are looking for sentences that contain those keywords, and the $ amount. The number of candidates obtained were as follows:

Train Candidates : 2887
Dev Candidates: 974
Test Candidates: 1005

Featurizer phase

featurizer_rev = Featurizer(session, [revenue_cand])

%time featurizer_rev.apply(split=0, train=True,parallelism=PARALLEL)
%time F_train_cands_rev = featurizer_rev.get_feature_matrices(train_cands_rev)
print("Train Candidates shape: {}".format(F_train_cands_rev[0].shape))

%time featurizer_rev.apply(split=1, parallelism=PARALLEL)
%time F_dev_cands_rev = featurizer_rev.get_feature_matrices(dev_cands_rev)
print("Dev Candidates shape: {}".format(F_dev_cands_rev[0].shape))

%time featurizer_rev.apply(split=2, parallelism=PARALLEL)
%time F_test_cands_rev = featurizer_rev.get_feature_matrices(test_cands_rev)
print("Test Candidates shape: {}".format(F_test_cands_rev[0].shape))

We got 58531 features for these candidates.

Gold labels generation

Created our own function to load and store the labels as directed in the hardware tutorial. Does this really have any significance in the entire pipeline? load_section_heading_gold_labels(session, [revenue_cand], gold_file, annotator_name='gold')

Candidate Labelling

In order to label such varied data, we thought of doing this process manually. So the candidate sentences were taken out and manually labelled. At the end of this we had a dictionary of sentences and we used it to mark the true and false candidates.

def is_revenue(c):
    if c.get_mentions()[0][0].get_span() in labelling_dict:
        if labelling_dict.get(c.get_mentions()[0][0].get_span()):
            return TRUE
        else:
            return FALSE
    return ABSTAIN

We noticed most of the files contained two or three true candidates and most were false candidates. There is a huge disparity in their no. Should this pose a problem? Is abstaining from voting a solution as per 'Automating the Generation of Hardware Component Knowledge Bases' research paper ? What criteria could be used for abstaining from voting (All are manually labelled here) ?

Then we ran the generative model to get the train marginals. Though, I doubt if this is required here as we never abstained from voting. I am really unclear why this is needed. But since the learning model needed it we used it

labeler_rev = Labeler(session, [revenue_cand])

%time labeler_rev.apply(split=0, lfs=[[is_revenue]], train=True, parallelism=PARALLEL)
%time L_train_rev = labeler_rev.get_label_matrices(train_cands_rev)

L_gold_train_rev = labeler_rev.get_gold_labels(train_cands_rev, annotator = 'gold')
analysis.lf_summary(L_train_rev[0], lf_names=labeler_rev.get_keys(), Y=L_gold_train_rev[0].todense().reshape(-1,).tolist()[0])

gen_model = LabelModel(k=2)
%time gen_model.train_model(L_train_rev[0], n_epochs=300, print_every=100)
train_marginals_rev = gen_model.predict_proba(L_train_rev[0])

Learning Phase

The model used was Sparse Logistic Regression.

disc_model = SparseLogisticRegression()
# disc_model = LSTM()
%time disc_model.train((train_cands_rev[0], F_train_cands_rev[0]), train_marginals_rev, n_epochs=100, lr=0.001)

The results were not very impressive. In our first attempt, with very few candidates revenue_pred_299_docs.pdf

In our second attempt,

Expectations & Queries

  1. Can such a use case be handled through fonduer?
  2. What could be the bottleneck? Is the number of documents sufficient to cater to what we want?
  3. Please suggest, if there is anything you would like us to change in our approach.
  4. Please answer the queries raised within each phase.

Sorry for such a long write-up. Please take your time to go through this and help me. Thanks.

atulgupta9 commented 5 years ago

@SenWu @HiromuHota , Can we use the generated feature size as an estimate to predict how the model would behave? If so, what is an appropriate range within which we can state that the predictions will be good?

senwu commented 5 years ago

Hi @atulgupta9,

Thanks for the description of your use case and I am sorry for the late response (a lot of dues this week).

Your problem is a very good example/use case for Fonduer. Let me try to To answer your question here:

We noticed most of the files contained two or three true candidates and most were false candidates. There is a huge disparity in their no. Should this pose a problem?

No, this is no problem and actually a very common case in extracting knowledge from richly formatted data. One general question you might care is how to generate your candidates since it's super easy to generate many negative candidates and miss positive candidates.

Is abstaining from voting a solution as per 'Automating the Generation of Hardware Component Knowledge Bases' research paper? What criteria could be used for abstaining from voting (All are manually labeled here)?

This is a good question and very important for the user who wants to provide weak supervisions in our framework. The abstain means this rule/pattern is applicable which is saying it's not an indicator for this specific candidate and you don't want to vote it (the system will ignore this candidate when evaluating this rule/pattern).

Can such a use case be handled through fonduer?

Your problem is a very good example/use case for Fonduer.

What could be the bottleneck?

As I mentioned before, there are several parts you want to consider: (1) how you generate the mentions/candidates? Your first attempt to mention/candidate extraction is good, and I think your second attempt can improve them a lot by adding missing ones and filtering mistakes. (2) how to label them? By checking your weak labels, you will have an estimation about the quality of your labels. (3) does the model can capture the signals? In our applications, we think the model incorporates a lot of signals from different modalities and it's a good baseline to have. You might want to add more features for your applications to improve it and Fonduer supports that as well.

Is the number of documents sufficient to cater to what we want?

Yes, this is a pretty good example and start point. I will be more powerful if you can add more documents in and let the model learn more.

Please suggest, if there is anything you would like us to change in our approach.

I think for each phrase, you can do a sanity check to make sure it matches your expectation.

Can we use the generated feature size as an estimate to predict how the model would behave? If so, what is an appropriate range within which we can state that the predictions will be good?

This is a good question, but I don't have the good answer for that. Fonduer generates all multi-modality features based on the feature library from the documents. One thing you can check is that whether the generated features can be a good indicator not.

Sen

atulgupta9 commented 5 years ago

Hi @SenWu

Thanks for all the help. Just want to have your quick thought on this.

I am using two sessions (Initialized using Fonduer Meta): one for training; other for predictions All the processes in the training session happen really quick but while using the other session for prediction it takes a considerable amount of time even to process one file. Can this be resolved?

Should the previous sessions be terminated before initiating a new one? I guess its something to do with SQLAlchemy.

senwu commented 5 years ago

Hi @atulgupta9,

Sorry for the late response! I think that might be a SQLAlchemy issue about inserting info into database. One potential solution is to reduce the parallelism.

Sen

atulgupta9 commented 5 years ago

Hi @SenWu @HiromuHota

Guys, I am trying to build api endpoints for automatically calling the fonduer built in functions.

I have two separate api endpoints : for mention/candidate extraction and for training.

These routes may be called any number of times under different projects. So you can assume that every time the api is called the mentions/candidates and the db we are referring to would be different.

Now, I have these variables candidate_<candidate_name> = candidate_subclass("candidate_<candidate_name>", [mention_<mention_name>])

Depending on the number of candidates user has specified. We will have that many variables.

Now for training the model, I would need these candidates for featurization and labelling. My initial thought was that a simple query to the underlying table candidate_ would work.

But then I realized, that I will have to store either the candidate_ variable defined above or the candiate_extractor that was defined to get the candidates.

I tried pickling the candidate_ variable but ended up with an error like this https://stackoverflow.com/questions/4677012/python-cant-pickle-type-x-attribute-lookup-failed

Can you guys help me find a way to do this? An easy solution that I thought of was to keep a dictionary in memory. But we will lose it if we restart the application.

Any help would be appreciated.

Thanks Atul

HiromuHota commented 5 years ago

As I mentioned in https://github.com/HazyResearch/fonduer/issues/259#issuecomment-494463893, it is hard to pickle a dynamically created class (e.g., mention/candidate subclasses in Fonduer). So I ended up carrying around the Python files, where mention/candidate subclasses are defined, rather than pickled files. Hope it helps.

atulgupta9 commented 5 years ago

Hi @HiromuHota @SenWu

I am getting this error, when I try to redefine an existing relation.

ERROR/MainProcess] Candidate subclass candidate_Alphatest_relation_12 already exists in memory with incompatible specification: ([<class 'fonduer.candidates.models.mention.mention_Alphatest15'>, <class 'fonduer.candidates.models.mention.mention_Alphatest16'>], 'candidate__alphatest_relation_12', 2, [True, False])

Is there any solution for this?

HiromuHota commented 5 years ago

Currently, candidate subclasses as well as mention subclasses can be defined only once. If you want to redefine them, you have to restart Python and define them.

Having said, the ability to redefine candidate/mention subclasses might be useful, especially during development. @SenWu any comment?

senwu commented 5 years ago

Unfortunately, we don't support runtime modify the candidate subclasses. But the current system supports create new candidate subclasses.

yzj19870824 commented 5 years ago

There’s no function “upsert_keys” in version 0.5.0 . May I ask what’s the replaced function name

HiromuHota commented 5 years ago

Featurizer#upsert_keys has been added at 0.7.0. Labeler#upsert_keys has also been added but not released yet. See https://fonduer.readthedocs.io/en/latest/dev/changelog.html