dongKenny / artveeScraper

Scrapes every image on artvee.com and collects the metadata in a json from a converted csv; the final json and images are uploaded to an aws s3 bucket.
20 stars 8 forks source link

img_dl_page IndexError #4

Open chapmanjacobd opened 3 weeks ago

chapmanjacobd commented 3 weeks ago

It seems like the website was updated since this script was written

---> 49     img_dl_page = requests.get("https://artvee.com/" + img_source[img_index].get("data-url"))                                                                                                                
     50     img_soup = BeautifulSoup(img_dl_page.content, "html.parser")                                                                                                                                             
     52     img_link = img_soup.find(                                                                                                                                                                                
     53         "a",                                                                                                                                                                                                 
     54         {                                                                                                                                                                                                    
     55             "class": "prem-link gr btn dis snax-action snax-action-add-to-collection snax-action-add-to-collection-downloads"                                                                                
     56         },                                                                                                                                                                                                   
     57     ).get("href")                                                                                                                                                                                            

IndexError: list index out of range                                                                                                                                                                                  
> /home/xk/bin/artveeScraper.py(49)scrape_images()
     47     """
     48 
---> 49     img_dl_page = requests.get("https://artvee.com/" + img_source[img_index].get("data-url"))
chapmanjacobd commented 3 weeks ago

if other people need this, you can replicate some of the functionality of the script with this tool that I wrote:

for cat in (cb) 
    lb linksdb artvee.db --path-include /dl/ --stop-pages-no-new 1 -c $cat https://artvee.com/c/$cat/page/1/?orderby=title_asc -v
end
library v2.8.063
['/home/xk/.local/bin/lb', 'linksdb', '/home/xk/lb/artvee.db', '--path-include', '/dl/', '--stop-pages-no-new', '1', 'https://artvee.com/c/figurative/page/1/?orderby=title_asc', '-v']
{'stop_pages_no_new': 1, 'path_include': ['/dl/'], 'paths': ['https://artvee.com/c/figurative/page/1/?orderby=title_asc']}
Extra playlists data {'hostname': 'artvee.com', 'time_created': 1718597247, 'time_modified': 0}
Loading page https://artvee.com/c/figurative/page/1/?orderby=title_asc
Page 1 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/2/?orderby=title_asc
Page 2 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/3/?orderby=title_asc
Page 3 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/4/?orderby=title_asc
Page 4 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/5/?orderby=title_asc
Page 5 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/6/?orderby=title_asc
Page 6 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/7/?orderby=title_asc
Page 7 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/8/?orderby=title_asc
Page 8 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/9/?orderby=title_asc
Page 9 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/10/?orderby=title_asc
Page 10 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/11/?orderby=title_asc
Retrying (Retry(total=7, connect=8, read=7, redirect=4, status=4)) after connection broken by 'RemoteDisconnected('Remote end closed connection without response')': /c/figurative/page/11/?orderby=title_asc
Page 11 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/12/?orderby=title_asc
Page 12 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/13/?orderby=title_asc
Page 13 link scan: 30 new [0 known]Loading page https://artvee.com/c/figurative/page/14/?orderby=title_asc

Then you have a list of works in artvee.db which you can download like this:

732 0.9s xk:/ (main|✔) 🃁 lb media lb/artvee.db -l 2 -pf
https://artvee.com/dl/1-plafond-du-tombeau-danna-n-81-2-plafond-du-tombeau-de-thotnofer-n-80/
https://artvee.com/dl/11-heures-du-soir-portrait-from-les-dix-huit-heures-dune-parisienne/
732 0.9s xk:/ (main|✔) 🍪 lb links (lb media lb/artvee.db -l 2 -pf) --path-include https://mdl.artvee.com/sdl/
https://mdl.artvee.com/sdl/101112absdl.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=MZIIT36VLAXUDXH6Q7YL/20240617/nyc3/s3/aws4_request&X-Amz-Date=20240617T041334Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86400&X-Amz-Signature=033296d7348de3da3d0c36b067b2bbb20c8cdf72e83b072d0c2352e072566c4e
https://mdl.artvee.com/sdl/214755fgsdl.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=MZIIT36VLAXUDXH6Q7YL/20240617/nyc3/s3/aws4_request&X-Amz-Date=20240617T041334Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86400&X-Amz-Signature=51c0204de2d39db11e56fed931b2cb2219a23ad20f956a9f6fb2c2e24660a1c3
733 3s xk:/ (main|✔) 🃂 lb links (lb media lb/artvee.db -l 2 -pf) --path-include https://mdl.artvee.com/sdl/ --download
733 5.4s xk:/ (main|?1) 🥨 ls mdl.artvee.com/sdl/
Permissions Size User Date Modified Git Name
.rw-r--r--@ 1.5M xk   27 Oct  2022   -N 101112absdl.jpg
.rw-r--r--@ 2.7M xk   27 Oct  2022   -N 214755fgsdl.jpg

But this is obviously missing metadata like Title and category. Maybe I can add a new subcommand that will keep track of linked downloads.... maybe call it linkdl... hmmm.... :/

edit: I added some functionality to the dl subcommand https://github.com/chapmanjacobd/library/commit/b435c64f7df43beb5c1c7736f0250bcc71305a10. If you don't want AVIF as the output format just remove --process-image:

lb dl --fs artvee.db --links --path-include https://mdl.artvee.com/sdl/ --process-image -l 2

The above command will download 2 images. To download all images use -l inf

sqlite artvee.db 'select * from media where path like "%mdl.artvee.com/%"'
[{"id": 36897, "time_created": 1718601086, "time_modified": 1667856899, "time_deleted": 0, "path": "/home/xk/github/xk/lb/mdl.artvee.com/sdl/243300fgsdl.avif", "category": "figurative", "title": "Girl with Blue Headscarf", "size": 154760, "duration": 0, "time_downloaded": 1718601081, "fps": 0, "type": "image/avif", "webpath": "https://artvee.com/dl/girl-with-blue-headscarf/"},
 {"id": 37018, "time_created": 1718601105, "time_modified": 1666850711, "time_deleted": 0, "path": "/home/xk/github/xk/lb/mdl.artvee.com/sdl/228959fgsdl.avif", "category": "figurative", "title": "Femme Pensive", "size": 324079, "duration": 0, "time_downloaded": 1718601081, "fps": 0, "type": "image/avif", "webpath": "https://artvee.com/dl/femme-pensive/"}]