Janpot / microdata-node

Cheerio based microdata parser
MIT License
57 stars 7 forks source link

Update parse.js to default to using 'content' attribute value for undefined elements #10

Closed Not-Jayden closed 4 years ago

Not-Jayden commented 4 years ago

In most cases microdata elements (especially div's) will use the content attribute to define the content which is usually formatted in a much more usable way that the text.

I think it makes sense to use the content attribute first, then text if it's undefined.

Janpot commented 4 years ago

microdata-node intends to be spec compliant.

Not-Jayden commented 4 years ago

Hmm ok. The W3C spec differs in that it suggests any element with a content attribute should default to the content attribute value, even though that's not consistent with the HTML standards.

Was just finding that it was making a lot of the data coming from pages I was crawling being parsed in an unusable format. Google's structured data testing tool also seems to default to the content attribute value as well.

Up to you which spec you'd like to follow though, can just keep using the fork otherwise :)

Janpot commented 4 years ago

Ok, can you fix the build and add a test?

Janpot commented 4 years ago

@Jaydenaus Ok, I'm taking some time to revisit this library after not touching it for 2 years and I realised that the .parse method you're updating here has been deprecated for a while. There should have been a message logged for this. Anyway, I'm preparing version two and now removing this function altogether. You should use the .toJson method instead. It looks like this one supports the content attribute for any element already.

I'm closing this PR. Feel free to submit a new issue if the problem persists.