HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

Removing several documents from the database by name? #278

Closed lauratolosi closed 5 years ago

lauratolosi commented 5 years ago

I want to test the Fonduer model on new files (pdfs), in a separate pipeline. I need to ensure that the file is not already in the database (which results in an error and failure anyway). The solution that I see is to delete the test documents and all related elements from the database before the testing - if they exist.

Can anyone please help with a procedure that deletes documents by name?

I have tried to delete from table documents:

#delete from document where id = 141250;
ERROR:  update or delete on table "document" violates foreign key constraint 
"table_document_id_fkey" on table "table"
DETAIL:  Key (id)=(141250) is still referenced from table "table".

The following is the detailed information on table document . It does not allow CASCADE DELETE on foreign keys on tables table, caption, paragraph, sentence, etc

#\d+ document
 ...
Indexes:
    "document_pkey" PRIMARY KEY, btree (id)
    "document_name_key" UNIQUE CONSTRAINT, btree (name)
Foreign-key constraints:
    "document_id_fkey" FOREIGN KEY (id) REFERENCES context(id) ON DELETE CASCADE
Referenced by:
    TABLE "caption" CONSTRAINT "caption_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id)
    TABLE "cell" CONSTRAINT "cell_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id)
    TABLE "figure" CONSTRAINT "figure_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id) ON DELETE CASCADE
    TABLE "model" CONSTRAINT "model_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id) ON DELETE CASCADE
    TABLE "model_power" CONSTRAINT "model_power_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id) ON DELETE CASCADE
    TABLE "paragraph" CONSTRAINT "paragraph_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id)
    TABLE "power" CONSTRAINT "power_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id) ON DELETE CASCADE
    TABLE "section" CONSTRAINT "section_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id) ON DELETE CASCADE
    TABLE "sentence" CONSTRAINT "sentence_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id)
    TABLE ""table"" CONSTRAINT "table_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id)

I have tried to delete from table context, but I am getting the same conflict.

lukehsiao commented 5 years ago

Hi @lauratolosi, can you describe your use scenario a little more?

It sounds like you ran the fonduer pipeline with document set A. Then are wanted to test something with document set B (which are not documents, not overlapping with A)?

I'm not sure I understand the scenario in which you want to evaluate new documents, but will not need to go through the process of parsing and everything again.

We do not currently have a way to delete a document directly. But, I'd assume that what you would want to do is to clean up your datasets (e.g. ensure no duplicates or overlaps in those sets) rather than trying to delete documents from the database just to reparse/add them. You might try checking out fdupes to help.

lauratolosi commented 5 years ago

Hi @lukehsiao, here are the details:

I need to use Fonduer in a real-world application, where a Fonduer trained model is saved on some server and users can upload new documents to get information from them, based on model predictions.

When a new document is uploaded, my understanding is that it needs to be parsed, in order for Mentions and Candidates to be extracted and evaluated by the model.

But a user can run the test many times with the same documents, which results in errors with Fonduer. I myself, when developing the system, was testing many times with on same new documents. I needed a way to remove those documents, handle the error that Document already exists, and start fresh.

I found a way around this by creating a copy of the original database every time a test on new documents starts. Maybe it is the easiest solution. Or maybe you have a better suggestion?

Thanks for looking into this!

senwu commented 5 years ago

Hi @lauratolosi,

There is an easy solution. We use document name as the primary key in Fonduer. There are two options you can do:

  1. Every time you want to test on one document, you create a temporary document name as the primary key and parse the document.
  2. You can the hash code for the document to see if you've already parsed this document. If yes, don't parse the new one and return the existing one. If no, create a temporary name for the document and parse it.

Hope this can help you!

Sen

senwu commented 5 years ago

Close for now, please reopen if it's still a problem.