OP-Engineering / link-preview-js

⛓ Extract web links information: title, description, images, videos, etc. [via OpenGraph], runs on mobiles and node.
MIT License
770 stars 124 forks source link

Receiving incorrect response for a specific website #124

Closed InaaraKalani closed 2 years ago

InaaraKalani commented 2 years ago

https://www.britishcouncil.pk/exam/school/your-world

The response I am receiving for this is: { url: 'https://www.britishcouncil.pk/exam/school/your-world', title: 'Access Denied', siteName: undefined, description: undefined, mediaType: 'website', contentType: 'text/html', images: [], videos: [], favicons: [ 'https://www.britishcouncil.pk/favicon.ico' ] }

kayode0x commented 2 years ago

From the README

This library acts as if the user would visit the page, sites might re-direct you to sign-up pages, consent screens, etc. You can try to change the user-agent header (try with google-bot or with Twitterbot), but you need to work around these issues yourself.

kayode0x commented 2 years ago

An easy workaround to get a better final response is to check the result and modify it if needed.

const previewCheck = (result = {}) => {
        const title = result.title.toLowerCase();
        const domain = getDomainFromURL(result.url) // https://www.britishcouncil.pk

         // common cases for unauthorized access.
        const cases = [
            "access denied",
            "attention required", "forbidden",
            "invalid", "not found",
            "unauthorized", "just a moment",
            "please wait", "processing",
            "server error", "unavailable",
            "403 forbidden",
            // add more if needed.
        ];

        for (const case of cases) {
            // if it has a normal title, skip.
            if (!title.startsWith(case)) continue;

           // here, the result was bad, you can modify it.
           result.title = domain
           result.siteName = domain

           // you can also do other kinds of stuff here like add a placeholder image / favicon.
        }

        return result.
};

const response = previewCheck(result);
InaaraKalani commented 2 years ago

From the README

This library acts as if the user would visit the page, sites might re-direct you to sign-up pages, consent screens, etc. You can try to change the user-agent header (try with google-bot or with Twitterbot), but you need to work around these issues yourself.

I tried google-bot and Twitterbot, they both returned with timeout. Are there any other user-agent headers? Can you share a list or a website link for them if possible?

InaaraKalani commented 2 years ago

An easy workaround to get a better final response is to check the result and modify it if needed.

const previewCheck = (result = {}) => {
        const title = result.title.toLowerCase();
        const domain = getDomainFromURL(result.url) // https://www.britishcouncil.pk

         // common cases for unauthorized access.
        const cases = [
            "access denied",
            "attention required", "forbidden",
            "invalid", "not found",
            "unauthorized", "just a moment",
            "please wait", "processing",
            "server error", "unavailable",
            "403 forbidden",
            // add more if needed.
        ];

        for (const case of cases) {
            // if it has a normal title, skip.
            if (!title.startsWith(case)) continue;

           // here, the result was bad, you can modify it.
           result.title = domain
           result.siteName = domain

           // you can also do other kinds of stuff here like add a placeholder image / favicon.
        }

        return result.
};

const response = previewCheck(result);

Thank you. This does help with modifying the result. But I would rather have the website's actual data. There are some website that are displaying the proper link preview (like Slack and Facebook) so I know it's possible. I just need to figure out the right crawler I guess.

ospfranco commented 2 years ago

Closing this. There is nothing wrong with the library, further questions should be done on StackOverflow.

Just as a tip: Slack/Facebook/Google might work because their servers could be whitelisted for crawling data, anti-crawl mitigations nowadays are complex.