ClimateMisinformation / Scrapers

Web scrapers
5 stars 1 forks source link

Remove boilerplate text and meta data from scraped sources #19

Open ricjhill opened 3 years ago

ricjhill commented 3 years ago

The scraped text has some part which are not useful content for analysing the patterns of language. Should these be removed? For example

135 of 313 Labels Share this...FacebookTwitter

Read the complete document: www.carbon-sense.com/wp-content/uploads/2008/05/alexander-2008.pdf [PDF, 266KB].

      __ATA.cmd.push(function() {
          __ATA.initDynamicSlot({
              id: 'atatags-1460517861-5fc7e51d7aed8',
              location: 120,
              formFactor: '001',
              label: {
                  text: 'Advertisements',
              },
              creative: {
                  reportAd: {
                      text: 'Report this ad',
                  },
                  privacySettings: {
                      text: 'Privacy settings',
                  }
              }
          });
      });
  Share this:PrintEmailTwitterFacebookPinterestLinkedInRedditLike this:Like Loading...
ricjhill commented 3 years ago
> <!--
> google_ad_client = "ca-pub-3545577860068042";
> /* neu test */
> google_ad_slot = "6412247007";
> google_ad_width = 200;
> google_ad_height = 200;
> //-->
ricjhill commented 3 years ago
  jQuery(document).ready(function(){
      jQuery('#dd_43fe30d37a49f1713b8a3a44662e0bc2').on('change', function() {
        jQuery('#amount_43fe30d37a49f1713b8a3a44662e0bc2').val(this.value);
      });
  });
ricjhill commented 3 years ago

ATA.cmd.push(function() { ATA.initDynamicSlot({ id: 'atatags-1460517861-5fc7ea6b2c1e4', location: 120, formFactor: '001', label: { text: 'Advertisements', }, creative: { reportAd: { text: 'Report this ad', }, privacySettings: { text: 'Privacy settings', } } }); }); Share this:PrintEmailTwitterFacebookPinterestLinkedInRedditLike this:Like Loading...

ricjhill commented 3 years ago

ATA.cmd.push(function() { ATA.initDynamicSlot({ id: 'atatags-1460517861-5fc7ea4e14bff', location: 120, formFactor: '001', label: { text: 'Advertisements', }, creative: { reportAd: { text: 'Report this ad', }, privacySettings: { text: 'Privacy settings', } } }); }); Share this:PrintEmailTwitterFacebookPinterestLinkedInRedditLike this:Like Loading...

ricjhill commented 3 years ago
  jQuery(document).ready(function(){
      jQuery('#dd_16efd3924a8804ec558ac63db78e3d5e').on('change', function() {
        jQuery('#amount_16efd3924a8804ec558ac63db78e3d5e').val(this.value);
      });
  });

Donate - choose an amount5101520501002505001000 Share this...FacebookTwitter