DataKind-BLR / PrathamBooks-Sprint-2018

Code and documentation for the collaboration with PrathamBooks during Sprint' 2018
MIT License
4 stars 7 forks source link

For stories, text content needs to be extracted from html page content #1

Closed arnabbiswas1 closed 6 years ago

arnabbiswas1 commented 6 years ago

For stories_pages.csv, "page_content" column consists of the html content of each page. The actual content of the story is needs to be extracted from the html content.

For example for the following html content, the actual story/text content which will be our interest will be "A fawn was racing in the forest." :

"<p class=""wysiwyg-text-align-left""><span class=""text-font-largest"">A fawn was racing in the forest.

<p class=""wysiwyg-text-align-left"">

<p class=""wysiwyg-text-align-left"">

"

End result should be:

  1. Modified stories_pages.csv (Do NOT commit data in github. You may mention data file name in .gitignore so that it does not get committed to github

  2. Script which is used to generate the data (MUST be committed to github)

siddjain24 commented 6 years ago

Hello @arnabbiswas1 , I can take this up

arnabbiswas1 commented 6 years ago

@siddjain24 Here you go!