Alina-enni / lingdiggers

Project for the Building NLP Applications course
0 stars 0 forks source link

Step 5 #13

Closed Alina-enni closed 2 years ago

Alina-enni commented 2 years ago
  1. This task is important. You need to index some "real" documents from a text file. When you run your program, it should start by reading document contents from a file and index these documents. After this, the user should be able to type queries and retrieve matching documents. Initially, you can use our example data sets: One contains 100 articles extracted from English Wikipedia and the other contains 1000 articles extracted from English Wikipedia (with topics mostly starting with the letter A). When you read these files, you need to produce a list of strings, such that an entire article (document) is in one string. You can locate the boundaries between two articles from the tag, which always occurs on a line of its own in the file. The text is UTF-8 encoded.
miglamigla commented 2 years ago

I added some code to open the text file split it into articles, but they're not very clean - should we do something about that?

swilli6 commented 2 years ago

I added HTML tag removal but it does remove the document title as well, which was included within an HTML tag previously.

Alina-enni commented 2 years ago

@swilli6 Are the titles needed?

swilli6 commented 2 years ago

@swilli6 Are the titles needed?

@Alina-enni Not necessarily I don’t think, but it would have been nice to have. No titles = difficult to find the article quickly if needed.