Closed InaaraKalani closed 2 years ago
This library acts as if the user would visit the page, sites might re-direct you to sign-up pages, consent screens, etc. You can try to change the user-agent header (try with google-bot or with Twitterbot), but you need to work around these issues yourself.
An easy workaround to get a better final response is to check the result and modify it if needed.
const previewCheck = (result = {}) => {
const title = result.title.toLowerCase();
const domain = getDomainFromURL(result.url) // https://www.britishcouncil.pk
// common cases for unauthorized access.
const cases = [
"access denied",
"attention required", "forbidden",
"invalid", "not found",
"unauthorized", "just a moment",
"please wait", "processing",
"server error", "unavailable",
"403 forbidden",
// add more if needed.
];
for (const case of cases) {
// if it has a normal title, skip.
if (!title.startsWith(case)) continue;
// here, the result was bad, you can modify it.
result.title = domain
result.siteName = domain
// you can also do other kinds of stuff here like add a placeholder image / favicon.
}
return result.
};
const response = previewCheck(result);
From the README
This library acts as if the user would visit the page, sites might re-direct you to sign-up pages, consent screens, etc. You can try to change the user-agent header (try with google-bot or with Twitterbot), but you need to work around these issues yourself.
I tried google-bot and Twitterbot, they both returned with timeout. Are there any other user-agent headers? Can you share a list or a website link for them if possible?
An easy workaround to get a better final response is to check the result and modify it if needed.
const previewCheck = (result = {}) => { const title = result.title.toLowerCase(); const domain = getDomainFromURL(result.url) // https://www.britishcouncil.pk // common cases for unauthorized access. const cases = [ "access denied", "attention required", "forbidden", "invalid", "not found", "unauthorized", "just a moment", "please wait", "processing", "server error", "unavailable", "403 forbidden", // add more if needed. ]; for (const case of cases) { // if it has a normal title, skip. if (!title.startsWith(case)) continue; // here, the result was bad, you can modify it. result.title = domain result.siteName = domain // you can also do other kinds of stuff here like add a placeholder image / favicon. } return result. }; const response = previewCheck(result);
Thank you. This does help with modifying the result. But I would rather have the website's actual data. There are some website that are displaying the proper link preview (like Slack and Facebook) so I know it's possible. I just need to figure out the right crawler I guess.
Closing this. There is nothing wrong with the library, further questions should be done on StackOverflow.
Just as a tip: Slack/Facebook/Google might work because their servers could be whitelisted for crawling data, anti-crawl mitigations nowadays are complex.
The response I am receiving for this is:
{ url: 'https://www.britishcouncil.pk/exam/school/your-world', title: 'Access Denied', siteName: undefined, description: undefined, mediaType: 'website', contentType: 'text/html', images: [], videos: [], favicons: [ 'https://www.britishcouncil.pk/favicon.ico' ] }