Closed shivamani-ans closed 6 years ago
Hi @shivamani-ans,
Sorry for the late response! I believe you have the same problem as #58.
As I pointed out there, we chose our IDF weighting as:
IDF = max(0, log((N - Nt + 0.5) / (Nt + 0.5)))
For N = 2 docs this will always be 0, as any token will always appear in ≥ 50% of the corpus.
I assume you are testing will be using more than 2 docs eventually -- in this case just test with a few more docs. For example:
{"id": "1", "text": "The American Civil War was fought in the United States from 1861 to 1865. The result of a long-standing controversy over slavery, war broke out in April 1861, when Confederates attacked Fort Sumter in South Carolina, shortly after President Abraham Lincoln was inaugurated. The nationalists of the Union proclaimed loyalty to the U.S. Constitution. They faced secessionists of the Confederate States, who advocated for states' rights to expand slavery."}
{"id": "2", "text": "Among the 34 U.S. states in February 1861, seven Southern slave states individually declared their secession from the U.S. to form the Confederate States of America, or the South. The Confederacy grew to include eleven slave states. The Confederacy was never diplomatically recognized by the United States government, nor was it recognized by any foreign country (although the United Kingdom and France granted it belligerent status). The states that remained loyal to the U.S. (including the border states where slavery was legal) were known as the Union or the North."}
{"id": "3", "text": "More stuff about Civil War"}
Otherwise you can change the IDF calculation as described in #58.
prep_wikipedia.py
is used to define the preprocess
function used in build_db.py
. build_db.py
expects the input file to have lines of JSON, each one defining a document with a id
and text
field.
If the JSON is formatted differently and requires some preprocessing, the script uses the preprocess
function to transform it. Here we are working with the output of the WikiExtractor script which formats things a bit differently, so that's why I used it.
To be honest, it's really not that useful and was only used for convenience. Someone could equivalently do:
Thank you adam for above response, i am happy with above explanation.
Hi,
I have created single file with 2 documents in below mentioned format and content {"id": "1", "text": "The American Civil War was fought in the United States from 1861 to 1865. The result of a long-standing controversy over slavery, war broke out in April 1861, when Confederates attacked Fort Sumter in South Carolina, shortly after President Abraham Lincoln was inaugurated. The nationalists of the Union proclaimed loyalty to the U.S. Constitution. They faced secessionists of the Confederate States, who advocated for states' rights to expand slavery."} {"id": "2", "text": "Among the 34 U.S. states in February 1861, seven Southern slave states individually declared their secession from the U.S. to form the Confederate States of America, or the South. The Confederacy grew to include eleven slave states. The Confederacy was never diplomatically recognized by the United States government, nor was it recognized by any foreign country (although the United Kingdom and France granted it belligerent status). The states that remained loyal to the U.S. (including the border states where slavery was legal) were known as the Union or the North."}
and executed build_db.py and build_tfidf.py got executed successfully .npy created but while executing a query no results are returned. executed statement process('American Civil War', k=5).
I have verified .npy by extracting it observed data.npy size is 0.
Can any one suggest what steps i have missed ?
I would like to also understand usage of prep_wikipedia.py (how to prepare file with wiki article).