dnif-archive / DigiVigi

GNU General Public License v3.0
1 stars 7 forks source link

Need a web scrapper to capture tabular data #4

Closed PRASHANT-SAWANT closed 6 years ago

PRASHANT-SAWANT commented 6 years ago

This is with reference to PROCESS 1 - Stage 3 given in README file.

Does anyone have a web scrapper that can scrape off data in tabular format on html pages and save it in a csv file. Also need the header of this csv file to be custom edited such that the header names should start as '$headername' - this is a mandatory requirement.

Also please add comments to your code so its better to understand. Thanks. :+1:

PRASHANT-SAWANT commented 6 years ago

@aakratisahu @Sharbanibasu23 @shreyaskulkarni412 There are a few ways of doing this. I found out there are libraries available already which do the scrapping process quite easily.

  1. Beautiful Soup
  2. Scrappy
  3. Selenium

Have already dirtied my hands in Beautiful Soup. Will add it in the repository soon. But you'll can try out different libraries as comfort suits. :smile:

shreyaskulkarni412 commented 6 years ago

@PRASHANT-SAWANT we understood that we can get data from any website by using any of these 3 libraries in python. But what can we do for retrieving it dynamically means as data is updated on the website our script should retrieve it on the real-time basis. Is there any specific way in python for it?

PRASHANT-SAWANT commented 6 years ago

Hey @shreyaskulkarni412 , currently we're looking for a code that can capture a snapshot of data from any website. Getting that data dynamically will come in PROCESS 2.

So, do you have any solution to the current situation here? That is to capture data in static and it should be usable in DNIF console and could further be analyzed. The code should produce this desired output. For, example i did write a code to capture a snapshot of data from a website and store it in a csv. But that csv file wouldn't upload in event store. So there's something wrong in that code. And only after correcting that issue with full fledged testing I'll push it to the repository.

PRASHANT-SAWANT commented 6 years ago

Any idea of how to prepend some data captured in a string text? I am looking forward to prepend it to a csv file though. Data captured is in the right format.

So, here's a doubt: I'm guessing that's a fundamental truth I am encountering here, that is - one generally can't prepend data to an existing flat structure without rewriting the entire structure. Surely this must be true regardless of language. Also it would be a messing with the headers.

But, I'm thinking it shouldn't matter when it comes down to analyzing the data in DNIF (Be it Ascending/ Descending)

Still worth asking at least. Any ideas how to prepend the csv without disturbing the headers?

PRASHANT-SAWANT commented 6 years ago

Hey, finally - built a working code to capture static data. The csv file that it is writing to is working well in DNIF event store as well.

I'll add the code file in the repository ASAP. Wrote it using BeautifulSoup library.

So lets get onto PROCESS 2 - Stage 3: Refer README to know what "I'm talking here about :sweat_smile: @shreyaskulkarni412 @aakratisahu @Sharbanibasu23