ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

JSON format #17

Closed noamross closed 9 years ago

noamross commented 10 years ago

Currently, output is formatted like so, a list of unnamed key-value pairs:

  {
    "publisher": "PeerJ Inc."
  },
  {
    "journal": "PeerJ"
  },
  {
    "title": "Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss"
  },
  {
    "authors": "Lynn M. Pique"
  },
  {
    "authors": "Marie-Luise Brennan"
  },
  {
    "authors": "Colin J. Davidson"
  }

Would it make more sense to format output like this?:

 "publisher": {"PeerJ Inc."}
 "journal": "PeerJ"
  "title": "Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss"
  "authors": {"Lynn M. Pique", "Marie-Luise Brennan", "Colin J. Davidson"}

I'm not so much referring to compressed spacing as structure. In the latter format, we have a list of keys and values. The former format adds an unnecessary layer that makes it harder to get things out. Does output.title refer to anything in the former structure? It seems you would need output(1).title. output.authors, on the other hand, should return a vector of all the authors, which can be subset as output.authors(1).

blahah commented 10 years ago

Fyi the multi-element values will have to be arrays:

"publisher": ["PeerJ Inc."],
"journal": "PeerJ",
"title": "Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss",
"authors": ["Lynn M. Pique", "Marie-Luise Brennan", "Colin J. Davidson"]

I'm also planning to implement this as the default behaviour in thresher - I think multiple elements matching a single selector should be collected as an array. It makes data analysis more intuitive.

blahah commented 10 years ago

Implemented (with various other features) in thresher - quickscrape will bring in these changes imminently.