Week 1 task 2 - Githubissues

Alina-enni / lingdiggers

Project for the Building NLP Applications course

0 stars 0 forks source link

Week 1 task 2 #8

Closed Alina-enni closed 2 years ago

Alina-enni commented 2 years ago

We need to write a program that extracts text from a web page.

We can use features from the Beautiful Soup library to extract portions such as headings or paragraphs. We can also extract plain text and use regular expressions or some other techniques to retrieve text.

The program needs to print the text as plain text without any HTML markup (meaning the HTML tags). We should also add some extra print statements that explain what information we are displaying and what it means.

miglamigla commented 2 years ago

Hi, just to clarify - do we need to extract only the title and text of this specific article (I think I've found a way to extract just those) or do we also need the titles of links to all the other articles/videos next to it? @swilli6 @Alina-enni

swilli6 commented 2 years ago

I like what you've done with the code so far, but I think the HTML page is supposed to be a changing one, like a news website. Also, at the moment the code produces tokenized content which I don't think was the goal? I'll poke around with the program and see what I can do.

swilli6 commented 2 years ago

@Alina-enni @miglamigla I made some changes, can you run the code and see what you think? If you're happy with this, we can close the issue and call it done :)

Alina-enni commented 2 years ago

@swilli6 @miglamigla I think it looks great now! I got completely stuck at the part where I was supposed to strip the unnecessary parts of the text because I'm not good with regular expressions. I also got confused by the instructions because I thought we might have to extract a longer text with headings and paragraphs ... But I think what we have now is good