Closed Alina-enni closed 2 years ago
Hi, just to clarify - do we need to extract only the title and text of this specific article (I think I've found a way to extract just those) or do we also need the titles of links to all the other articles/videos next to it? @swilli6 @Alina-enni
I like what you've done with the code so far, but I think the HTML page is supposed to be a changing one, like a news website. Also, at the moment the code produces tokenized content which I don't think was the goal? I'll poke around with the program and see what I can do.
@Alina-enni @miglamigla I made some changes, can you run the code and see what you think? If you're happy with this, we can close the issue and call it done :)
@swilli6 @miglamigla I think it looks great now! I got completely stuck at the part where I was supposed to strip the unnecessary parts of the text because I'm not good with regular expressions. I also got confused by the instructions because I thought we might have to extract a longer text with headings and paragraphs ... But I think what we have now is good
We need to write a program that extracts text from a web page.
We can use features from the Beautiful Soup library to extract portions such as headings or paragraphs. We can also extract plain text and use regular expressions or some other techniques to retrieve text.
The program needs to print the text as plain text without any HTML markup (meaning the HTML tags). We should also add some extra print statements that explain what information we are displaying and what it means.