TypeError: expected string or buffer

shivamani-ans commented 6 years ago

Hi, I have uploaded 7 documents and one of the documents is like below. "The American Civil War was fought in the United States from 1861 to 1865. The result of a long-standing controversy over slavery, war broke out in April 1861, when Confederates attacked Fort Sumter in South Carolina, shortly after President Abraham Lincoln was inaugurated. The nationalists of the Union proclaimed loyalty to the U.S. Constitution. They faced secessionists of the Confederate States of America, who advocated for states' rights to expand slavery.

Among the 34 U.S. states in February 1861, seven Southern slave states individually declared their secession from the U.S. to form the Confederate States of America, or the South. The Confederacy grew to include eleven slave states. The Confederacy was never diplomatically recognized by the United States government, nor was it recognized by any foreign country (although Britain and France granted it belligerent status). The states that remained loyal, including the border states where slavery was legal, were known as the Union or the North. The North and South quickly raised volunteer and conscription armies that fought mostly in the South over four years. The Union finally won the war when General Robert E. Lee surrendered to General Ulysses S. Grant at the Battle of Appomattox Court House followed by a series of surrenders by Confederate generals throughout the southern states. Four years of intense combat left 620,000 to 750,000 soldiers dead, a higher number than the number of American military deaths in all other wars combined. Much of the South's infrastructure was destroyed, especially the transportation systems, railroads, mills and houses. The Confederacy collapsed, slavery was abolished, and 4 million slaves were freed. The Reconstruction Era (1863–1877) overlapped and followed the war, with the process of restoring national unity, strengthening the national government, and granting civil rights to freed slaves throughout the country. The Civil War is the most studied and written about episode in American history.

In the 1860 presidential election, Republicans, led by Abraham Lincoln, supported banning slavery in all the U.S. territories. The Southern states viewed this as a violation of their constitutional rights and as the first step in a grander Republican plan to eventually abolish slavery. The three pro-Union candidates together received an overwhelming 82% majority of the votes cast nationally: Republican Lincoln's votes centered in the north, Democrat Stephen A. Douglas' votes were distributed nationally and Constitutional Unionist John Bell's votes centered in Tennessee, Kentucky, and Virginia. The Republican Party, dominant in the North, secured a plurality of the popular votes and a majority of the electoral votes nationally, so Lincoln was constitutionally elected president. He was the first Republican Party candidate to win the presidency. However, before his inauguration, seven slave states with cotton-based economies declared secession and formed the Confederacy. The first six to declare secession had the highest proportions of slaves in their populations, a total of 49 percent. The first seven with state legislatures to resolve for secession included split majorities for unionists Douglas and Bell in Georgia with 51% and Louisiana with 55%. Alabama had voted 46% for those unionists, Mississippi with 40%, Florida with 38%, Texas with 25%, and South Carolina cast Electoral College votes without a popular vote for president. Of these, only Texas held a referendum on secession." copied from wiki - American Civil War. I have built TFIDF successfully but while querying using python3.6 scripts/pipeline/interactive.py --retriever-model /home/shiva/DrQA/data/sample-tfidf-ngram=2-hash=16777216-tokenizer=corenlp.npz

getting below error

process("Ulysses S. Grant") 01/08/2018 08:46:58 PM: [ Processing 1 queries... ] 01/08/2018 08:46:58 PM: [ Retrieving top 5 docs... ] Traceback (most recent call last): File "", line 1, in File "scripts/pipeline/interactive.py", line 81, in process question, candidates, top_n, n_docs, return_context=True File "/home/shiva/DrQA/drqa/pipeline/drqa.py", line 184, in process top_n, n_docs, return_context File "/home/shiva/DrQA/drqa/pipeline/drqa.py", line 217, in process_batch for split in splits: File "/home/shiva/DrQA/drqa/pipeline/drqa.py", line 147, in _split_doc for split in regex.split(r'\n+', doc): File "/usr/local/lib/python3.6/site-packages/regex.py", line 319, in split return _compile(pattern, flags, kwargs).split(string, maxsplit, concurrent) TypeError: expected string or buffer

could you please suggest whether i missed any thing or any limitation in code.

ajfisch commented 6 years ago

You'll also need to supply the --doc-db path to the db you built the tfidf model on. The error messages here are very bad, but what I believe is happening is that your TFIDF model returned a doc id that doesn't exist in the database, which by default is Wikipedia (indexed by page titles).

shivamani-ans commented 6 years ago

Hi Adam, Thank you for your reply.

I have mentioned --doc-db path while building tfidf model and related document is existed in database (here i am using custom database sample.db which consists of 7 documents).

I am getting issue for only above mentioned document remaining document querying are working perfectly.

Do i need to do any pre-processing ?

ajfisch commented 6 years ago

Sorry, just to confirm:

scripts/pipeline/interactive.py --retriever-model /home/shiva/DrQA/data/sample-tfidf-ngram=2-hash=16777216-tokenizer=corenlp.npz --doc-db sample.db

is what you are running? (Your original paste is missing the --doc-db parameter for the interactive pipeline).

ajfisch commented 6 years ago

Another thing to check is if the doc id returned by the interactive retriever matches the one in the db (you can open the db with sqlite3 and select * from documents where id = <insert id>.

shivamani-ans commented 6 years ago

I forgot to inform my input is .docx file and getting data with below code def getText(filename): doc = docx.Document(filename) fullText = [] for para in doc.paragraphs: fullText.append(para.text) return '\n'.join(fullText)

and inserting to doc['text'] and filename in fullpath to doc['id']

I have verified doc id returned from interactive retriever python3.6 scripts/retriever/interactive.py --model /home/shiva/DrQA/data/sample-tfidf-ngram=2-hash=16777216-tokenizer=corenlp.npz and verified it with sqlite3 database documents table record it is matching.

shivamani-ans commented 6 years ago

Hi Adam,

python3.6 scripts/pipeline/interactive.py --retriever-model /home/shiva/DrQA/data/sample-tfidf-ngram=2-hash=16777216-tokenizer=corenlp.npz --doc-db /home/shiva/sample.db --tokenizer corenlp

is working for me thank you

But some times results are weird e.g. i have given process("Ulysses S. Grant") but retrieved results are related to "Franklin" that is Meanwhile, Sherman maneuvered from Chattanooga to Atlanta, defeating Confederate Generals Joseph E. Johnston and John Bell Hood along the way. The fall of Atlanta on September 2, 1864, guaranteed the reelection of Lincoln as president. Hood left the Atlanta area to swing around and menace Sherman's supply lines and invade Tennessee in the Franklin-Nashville Campaign. Union Maj. Gen. John Schofield defeated Hood at the Battle of Franklin, and George H. Thomas dealt Hood a massive defeat at the Battle of Nashville, effectively destroying Hood's army.

How we need to train reader to get better results ? Do we need to build question answer like SQUAD ? Can give me suggestion or direction to get better results for particular document question answering.

Thank you

ajfisch commented 6 years ago

The retriever works on the document level, and processes all paragraphs of the document regardless of whether they have a TF-IDF score with the query. Is that paragraph part of a larger document about Grant perhaps?

It is also trained for more specific factoid questions. Like, "Who surrendered to Ulysses S. Grant?".

shivamani-ans commented 6 years ago

Yes correct, paragraph part of a larger document is about Grant that is for sure.

I would more likely to know about natural language processing there how it is understanding intent from entities mentioned in Question ? Is it really happening there ?

When we compare DrQA solution to IBM Watson Discovery what differences we can find in DrQA solution? I am glad that DrQA answers factoid question on training where it is not available in Watson Discovery but how about understanding intent from entities using NLP ?

Could please explain if possible ?

ajfisch commented 6 years ago

DrQA isn't trained for intent classification, and is not meant to be as general a package as say, IBM Watson Discovery. The provided pre-trained models are intended for natural language style factoid questions.

The main idea behind DrQA (as noted in the README and paper) is trying to approach the problem of "Machine Reading at Scale" --- that is, combining fine-grained neural reading comprehension methods with information retrieval in order to leverage large corpora.

shivamani-ans commented 6 years ago

Hi Adam,

Sorry for late reply. Thank you for your response. I have one last question I have inserted a document related to chemistry text book (atomic structure) if have process any query it is providing me result from the document even though i didn't trained document reader.

Can you please explain necessity of training document reader with pre-defined question and answer in json format, i mean will there any difference with/with out training document reader.

thank you

ajfisch commented 6 years ago

If you downloaded the pre-trained models we provided, then DrQA will use the default reader model trained on the SQuAD, WebQuestion, WikiMovies, and CuratedTREC datasets. This model might be ok. A reason to provide your own training data and fine tune (or re-train from scratch) a reader model would be if your target data (both wording of the documents and the questions) is from a distinctly different domain. In this situation, the pre-trained model might not transfer that well.

You can see more related discussion in section 5.3 of the paper accompanying this repository -- namely the benefits we saw from using distant supervision to improve the generalization of the model to more QA domains (i.e {SQuAD} --> {SQuAD, WikiMovies, WebQuestion, CuratedTREC}).

shivamani-ans commented 6 years ago

Hi Adam,

from above explanation i understood that different domain questions will not be answered properly we are not trained with that different domain questions of a document.

Is my understanding is correct.

Can you tell me is it mandatory to train existing model or build model from scratch with specific domain questions. Because actually what we are trying to do is upload a document i.e. human brain and ask a question and we have not trained with human brain questions but it is trained with some other documents related to chemistry concepts, so do we get related/relevant answer for question asked.

It is always not possible to build questions for uploaded document domain by user.

Can you please explain how trained model without upload domain knowledge will work in above situation.

facebookresearch / DrQA

TypeError: expected string or buffer #77