TryGhost / migrate

MIT License
39 stars 18 forks source link

Sometimes substack pages have malformed image exports that break migrator #962

Closed randyau closed 8 months ago

randyau commented 8 months ago

Using the CLI migrate tool on my substack and about half of them have broken feature images. Same thing also happens with the Beta migrator tool in Labs since it's probably using this exact same code

migrate substack -v 0.36.2

The feature_image export in ghost-import.json features a CDN's URL instead of the expected local scraped copy. Following the CDN link yields an "Access Denied" error

"title": "We might not see leap seconds after 2035 🤯",
...
"feature_image": "https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/7c193e59-fad6-49df-b659-b16976e1ce59_1024x683.jpeg",

I went to the originating post in the exported html from Substack (exported 2023-12-28), and the top image that should've been converted to the featured_image is this img tag. Looks like the img sources the "bucketeer" AWS host that is the broken url being imported, and also has a data-attrs referencing the same broken url. Not sure which one the migrating tool is pulling. The tag also provides a raft of srcsets to actual images that are downloadable, so the html actually displays in a browser.

<img src="https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/7c193e59-fad6-49df-b659-b16976e1ce59_1024x683.jpeg" 

width="1200" 
height="800.390625" 

data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/7c193e59-fad6-49df-b659-b16976e1ce59_1024x683.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:683,&quot;width&quot;:1024,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:188889,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null}" 

class="sizing-large" 
alt="" 

srcset="https://substackcdn.com/image/fetch/w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7c193e59-fad6-49df-b659-b16976e1ce59_1024x683.jpeg 424w, 
https://substackcdn.com/image/fetch/w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7c193e59-fad6-49df-b659-b16976e1ce59_1024x683.jpeg 848w, 
https://substackcdn.com/image/fetch/w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7c193e59-fad6-49df-b659-b16976e1ce59_1024x683.jpeg 1272w, 
https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7c193e59-fad6-49df-b659-b16976e1ce59_1024x683.jpeg 1456w" 

sizes="100vw" 
fetchpriority="high">

The problem seems to affect all my posts prior to around January 2023, but it's not clear why there's a difference at all.

example of a broken html file from the export here buggy_html.zip

PaulAdamDavis commented 8 months ago

Hi @randyau,

Thanks for the detailed report and sample file! 🙌

I've had a quick look and can see a solution, which I'll get implemented & released to the CLI tools and beta migratory soon. I'll update this issue when that's done.

PaulAdamDavis commented 8 months ago

This is now fixed and released in @tryghost/migrate@0.37.0 & @tryghost/mg-substack@0.4.0, and in the self-service migration tools.