brick / schema

Schema.org library for PHP
MIT License
50 stars 6 forks source link

Question if PR will be accepted #3

Open MarGul opened 3 years ago

MarGul commented 3 years ago

Hi guys.

Not sure how I can request a feature before I build it to see if it will be accepted. At ThingConverter you are doing the right thing when ignoring a type of https://schema.org./product with a lowercase first letter.

Unfortunately I'm crawling a site that do not respect the full schema.org contract and are using a lowercase first letter. If I change the preg_match in ThingsConverter to preg_match('/https?\:\/\/schema.org\/([a-zA-z0-9]+)$/', $schemaOrgId, $matches) === 1 it all works because now it doesn't have to be an uppercase first letter.

Would this be OK otherwise I will have to extend the package by creating my own SchemaReader and own ThingsConverter which would be fine but would you accept this change?

Thanks for a great package!

BenMorel commented 3 years ago

Hi, it depends on whether the thing names are case-sensitive, in which case I'd be happy to accept a PR to change this. Can you find a reference somewhere?

MarGul commented 3 years ago

Thanks for your quick response @BenMorel . I should have done this research before but it looks like Schema is case-sensitive so we shouldn't change this. This is from Google Note: Schema.org markup can be used on web pages written in any language. The markup, like HTML, is in English. Schema.org values are case-sensitive..

What do you think is the best way for me to solve this? It will be hard but maybe I can parse the HTML before I send it to the ThingsConverter and look for all https://schema.org/product and change it to https://schema.org/Product. Just a little bit concerned if I have to do this for every corner-case that I find.

BenMorel commented 3 years ago

Schema.org values are case-sensitive.

I'm still not sure what they understand by values. I'd rather try to find this information on https://schema.org/ if it exists!

Otherwise, TBH I'm not sure what value case-sensitivity brings, I mean, nobody will ever create a schema.org/product to compete with schema.org/Product, so even if the spec says that it should be case-sensitive, I wouldn't be against a boolean configuration somewhere to make it lenient and therefore case-insensitive.

But let's first be 100% sure that it's supposed to be case-sensitive, before introducing a setting.

Also please test the output of the Google Structured Data Testing Tool and Yandex Structured data validator to see how they behave with a lowercase product.

Report your findings here, and we'll decide what to do!

MarGul commented 3 years ago

I agree. Don't see why we should completely dismiss the item if it's a little bit misspelled. Maybe go with just checking https://schema.org like you say would be sufficient. I think so.

The Yandex tool was down for me but with Google Structured Data Testing Tool it does recognize https://schema.org/product as a Thing. This is the URL that I'm using https://www.telia.se/privat/telefoni/telefoner/produkt/sony-xperia-5-ll.

Personally I don't think that this needs a setting because it's pretty clear that they have tried to put in structured data and Google are accepting it as well.

BenMorel commented 3 years ago

I can confirm that Google handles it case-insensitively.

I could not find a better source for case sensitivity, but this link says:

While Schema.org vocabulary is case sensitive (and you can enable this option), Google is not as strict – so this isn’t required for Google rich result features and their understanding of structured data.

I think it could make sense to make case sensitivity an option, still.

MarGul commented 3 years ago

I agree that a config wouldn't hurt but I think default should be case-insensitively. Where do you think a config like this should reside?

Do you then find it sufficient to "only" build up the preg_match string depending on this config at ThingsConverter like

$pattern = $caseInsensitively ? '[a-zA-z0-9]' : '[A-Za-z0-9]';

if (preg_match("/https?\:\/\/schema.org\/([{$pattern}]+)$/", $schemaOrgId, $matches) === 1) {
    //
}
jhard commented 1 year ago

This would probably have to go into ObjectFactory, since only there it will be known what properties exist. The regexp in ThingsConverter already matches both, the order in the character group does not matter. In ObjectFactory, it could be something like this to determine the types in the build function:

        $types = array_map(function(string $type) {
           if(isset($this->propertiesByType[$type])) {
              return $type;
           }
           foreach(array_keys($this->propertiesByType) as $potentialType) {
              if(strtolower($potentialType) === strtolower($type)) {
                 return $potentialType;
              }
           }
           return null;
        }, $types);

        $types = array_filter(array_values($types));

Which isn't super efficient, but it'll get the job done. I guess the best way to make it configurable would be to have the config live as a property on SchemaReader and then pass it as a parameter into ThingConverter, which would pass it into ObjectFactory, wouldn't it?