dharmafly / noodle

A node server and module which allows for cross-domain page scraping on web documents with JSONP or POST.
https://noodle.dharmafly.com/
745 stars 69 forks source link

mapped queries not working #102

Closed khchan closed 8 years ago

khchan commented 9 years ago

Using the following as a sample query to noodle doesn't produce expected output results, but instead returns full html of github page.

Query:

"url": "https://github.com/chrisnewtn",
"type": "html",
"map": {
    "person": {
        "selector": "span[itemprop=name]",
        "extract": "text"
    },
    "repos": {
        "selector": "li span.repo",
        "extract": "text"
    }
}
AaronAcerboni commented 9 years ago

Hi @khchan

I am having difficulties reproducing your bug. Do you know if you made any requests prior to receiving the full GitHub page html? Maybe it is an issue with the cache.

This is the code I am using to try and uncover your problem.

var noodle = require('noodle'),
    fs     = require('fs');

noodle.query({
    "url": "https://github.com/chrisnewtn",
    "type": "html",
    "map": {
        "person": {
            "selector": "span[itemprop=name]",
            "extract": "text"
        },
        "repos": {
            "selector": "li span.repo",
            "extract": "text"
        }
    }
})
.then(function (fetched) {
  var str = JSON.stringify(fetched.results);
  fs.writeFileSync('output.json', str, 'utf8');
});

The output which follows is correct:

[
   {
      "results":{
         "person":[
            "Chris Newton"
         ],
         "repos":[
            "cmd-async-slides",
            "jquery-async-uploader",
            "simplechat",
            "sitestatus",
            "backbone.iobind",
            "routemaster",
            "asyncjs.github.com",
            "selleckt",
            "expectations",
            "noodle"
         ]
      },
      "created":"2014-11-20T20:20:16.569Z"
   }
]